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This volume contains 3 invited papers, 15 regular papers, and 22 poster papers 
that were selected for presentation at the Third International Conference on 
Discovery Science (DS 2000), which was held 4-6 December 2000 in Kyoto. The 
Program Committee selected the contributed papers from 48 submissions. 

Three distinguished researchers accepted our invitation to present talks: Jef- 
frey D. Ullman (Stanford University), Joseph Y. Halpern (Cornell University), 
and Masami Hagiya (University of Tokyo) . 

The Program Committee would like to thank all those who submitted papers 
for consideration and the invited speakers. I would like to thank the Program 
Committee members, the Local Arrangements Committee members, and the 
Steering Committee members for their splendid and hard work. Finally, special 
thanks go to the PC Assistant Shoko Suzuki for her assistance in the development 
of web pages and the preparation of these proceedings. 
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A Survey of Association-Rule Mining 



Jeffrey D. Ullman 

Stanford University, Stanford CA 94305 USA 



Abstract. The standard model for association-rule mining involves a set 
of “items” and a set of “baskets.” The baskets contain items that some 
customer has purchased at the same time. The problem is to hnd pairs, or 
perhaps larger sets, of items that frequently appear together in baskets. 
We mention the principal approaches to efficient, large-scale discovery 
of the frequent itemsets, including the a-priori algorithm, improvements 
using hashing, and one- and two-pass probabilistic algorithms for hnding 
frequent itemsets. We then turn to techniques for hnding highly corre- 
lated, but infrequent, pairs of items. These notes were written for CS345 
at Stanford University and are reprinted by permission of the author. 
http://www-db.stanford.edu/~ullman/mining/mining.html gives you 
access to the entire set of notes, including additional citations and on-line 
links. 



1 Association Rules and Frequent Itemsets 

The market-basket problem assumes we have some large number of items, e.g., 
“bread” or “milk.” Customers fill their market baskets with some subset of the 
items, and we get to know what items people buy together, even if we don’t know 
who they are. Marketers use this information to position items, and control the 
way a typical customer traverses the store. 

In addition to the marketing application, the same sort of question has the 
following uses: 

1. Baskets = documents; items = words. Words appearing frequently together 
in documents may represent phrases or linked concepts. One possible appli- 
cation is intelligence gathering. 

2. Baskets = sentences, items = documents. Two documents with many of the 
same sentences could represent plagiarism or mirror sites on the Web. 

1.1 Goals for Market-Basket Mining 

1. Association rules are statements of the form {Xi, X 2 , ■ ■ ■ , X„} =k Y , mean- 
ing that if we find all of Xi, X 2 , ■ ■ ■ , X„ in the market basket, then we have 
a good chance of finding Y . The probability of finding Y given {Xi, . . . , X„ } 
is called the confidence of the rule. We normally would accept only rules that 
had confidence above a certain threshold. We may also ask that the confi- 
dence be significantly higher than it would be if items were placed at random 

S. Arikawa and S. Morishita (Eds.): DS 2000, LNAI 1967, pp. 1-14, 2000. 
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into baskets. For example, we might find a rule like {milk, butter} => bread 
simply because a lot of people buy bread. Flowever, the beer/diapers story^ 
asserts that the rule {diapers} => beer holds with conhdence signihcantly 
greater than the fraction of baskets that contain beer. 

2. Causality. Ideally, we would like to know that in an association rule the pres- 
ence of Xi, . . . , X„ actually “causes” Y to be bought. However, “causality” 
is an elusive concept. Nevertheless, for market-basket data, the following test 
suggests what causality means. If we lower the price of diapers and raise the 
price of beer, we can lure diaper buyers, who are more likely to pick up beer 
while in the store, thus covering our losses on the diapers. That strategy 
works because “diapers causes beer.” However, working it the other way 
round, running a sale on beer and raising the price of diapers, will not result 
in beer buyers buying diapers in any great numbers, and we lose money. 

3. Frequent itemsets. In many (but not all) situations, we only care about as- 
sociation rules or causalities involving sets of items that appear frequently 
in baskets. For example, we cannot run a good marketing strategy involving 
items that almost no one buys anyway. Thus, much data mining starts with 
the assumption that we only care about sets of items with high support, 
i.e., they appear together in many baskets. We then hud association rules or 
causalities only involving a high-support set of items (i.e., {Xi, . . .,X„,Y} 
must appear in at least a certain percent of the baskets, called the support 
threshold. 

1.2 Framework for Frequent Itemset Mining 

We use the term frequent itemset for “a set S that appears in at least fraction s 
of the baskets,” where s is some chosen constant, typically 0.01 or 1%. 

We assume data is too large to ht in main memory. Either it is stored in a re- 
lational database, say as a relation Baskets{BID, item) or as a flat hie of records 
of the form (BID, item!, item2, . . . , itemn). When evaluating the running time 
of algorithms we: 

— Count the number of passes through the data. Since the principal cost is 
often the time it takes to read data from disk, the number of times we 
need to read each datum is often the best measure of running time of the 
algorithm. 

There is a key principle, called monotonicity or the a-priori trick, that helps 
us hud frequent itemsets: 

— If a set of items S is frequent (i.e., appears in at least fraction s of the 
baskets), then every subset of S is also frequent. 

^ The famous, and possibly apocraphal discovery that people who buy diapers are 
unusually likely to buy beer. 
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— Put in the contrapositive: a set S cannot be frequent unless all its subsets 
are. 

To find frequent itemsets, we can: 

1. Proceed levelwise, finding first the frequent items (sets of size 1), then the 
frequent pairs, the frequent tripies, etc. In our discussion, we concentrate on 
hnding frequent pairs because: 

(a) Often, pairs are enough. 

(b) In many data sets, the hardest part is hnding the pairs; proceeding to 
higher levels takes less time than hnding frequent pairs. 

Levelwise algorithms use one pass per level. 

2. Find ail maximal frequent itemsets (i.e., sets S of any size, such that no 
proper superset of S is frequent) in one pass or a few passes. 

1.3 The A-Priori Algorithm 

The following is taken from [1], [2]. The algorithm called a-priori proceeds lev- 
elwise. 

1. Given support threshold s, in the hrst pass we hnd the items that appear in 
at least fraction s of the baskets. This set is called Li, the frequent items. 
Presumably there is enough main memory to count occurrences of each item, 
since a typical store sells no more than 100,000 different items. 

2. Pairs of items in Li become the candidate pairs C '2 for the second pass. We 
hope that the size of C '2 is not so large that there is not room in main memory 
for an integer count per candidate pair. The pairs in C '2 whose count reaches 
s are the frequent pairs, L 2 - 

3. The candidate triples, C 3 are those sets {A,B,C} such that all of 
{A,C}, and {B,C} are in L 2 - On the third pass, count the occurrences of 
triples in C's] those with a count of at least s are the frequent triples, L 3 . 

4. Proceed as far as you like (or the sets become empty). Li is the frequent sets 
of size i; C'i+i is the set of sets of size i -f 1 such that each subset of size i is 
in Li. 

1.4 Why A-Priori Helps 

Consider the following SQL on a Baskets[BID,item) relation with 10® tuples 
involving 10^ baskets of 10 items each; assume 100,000 different items (typical 
of Wal-Mart, e.g.). 
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SELECT bl.item, b2.item, COUNT(+) 

FROM Baskets bl, Baskets b2 

WHERE bl.BID = b2.BID AMD bl.item < b2.item 

GROUP BY bl.item, b2.item 

HAVING COUMT(+) >= s; 

Note: s is the support threshold, and the second term of the WHERE clause is 
to prevent pairs of items that are really one item, and to prevent pairs from 
appearing twice. 

In the join Baskets ixi Baskets, each basket contributes ( 2 '') = 45 pairs, so 
the join has 4.5 x 10® tuples. 

A-priori “pushes the HAVING down the expression tree,” causing us hrst to 
replace Baskets by the result of 

SELECT + 

FROM Baskets 
GROUP by item 
HAVING C0UMT(+) >= s; 

If s = 0.01, then at most 1000 items’ groups can pass the HAVING condition. 
Reason: there are 10® item occurrences, and an item needs 0.01 x 10^ = 10® of 
those to appear in 1 % of the baskets. 

— Although 99% of the items are thrown away by a-priori, we should not as- 
sume the resulting Baskets relation has only 10® tuples. In fact, all the tuples 
may be for the high-support items. However, in real situations, the shrinkage 
in Baskets is substantial, and the size of the join shrinks in proportion to 
the square of the shrinkage in Baskets. 

1.5 Improvements to A-Priori 

Two types: 

1. Cut down the size of the candidate sets Ci for i > 2. This option is important, 
even for hnding frequent pairs, since the number of candidates must be 
sufficiently small that a count for each can ht in main memory. 

2. Merge the attempts to hnd Li, L 2 , T 3 , • • • into one or two passes, rather than 
a pass per level. 

1.6 PCY Algorithm 

Park, Chen, and Yu [5] proposed using a hash table to determine on the hrst 
pass (while Ti is being determined) that many pairs are not possibly frequent. 
PCY takes advantage of the fact that main memory is usually much bigger than 
the number of items. During the two passes to hnd L 2 , the main memory is laid 
out as in Fig. 1. 

Assume that data is stored as a hat hie, with records consisting of a basket 
ID and a list of its items. 
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Fig. 1. Two passes of the PCY algorithm 



1. Pass 1: 

(a) Count occurrences of all items. 

(b) For each bucket, consisting of items {ii, . . . ,ik}, hash each pair to a 
bucket of the hash table, and increment the count of the bucket by 1. 

(c) At the end of the pass, determine Li, the items with counts at least s. 

(d) Also at the end, determine those buckets with counts at least s. 

— Key point: a pair (i,j) cannot be frequent unless it hashes to a 
frequent bucket, so pairs that hash to other buckets need not be 
candidates in C 2 . 

Replace the hash table by a bitmap, with one bit per bucket: 1 if the 
bucket was frequent, 0 if not. 

2. Pass 2: 

(a) Main memory holds a list of all the frequent items, i.e. Li. 

(b) Main memory also holds the bitmap summarizing the results of the hash- 
ing from pass 1. 

— Key point: The buckets must use 16 or 32 bits for a count, but these 
are compressed to 1 bit. Thus, even if the hash table occupied almost 
the entire main memory on pass 1, its bitmap occupies no more than 
1/16 of main memory on pass 2. 

(c) Finally, main memory also holds a table with all the candidate pairs and 
their counts. A pair (i,j) can be a candidate in C '2 only if all of the 
following are true: 
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i. i is in L\ . 

ii. i is in Li. 

iii. (i,j) hashes to a frequent bucket. 

It is the last condition that distinguishes PCY from straight a-priori and 
reduces the requirements for memory in pass 2. 

(d) During pass 2, we consider each basket, and each pair of its items, making 
the test outlined above. If a pair meets all three conditions, add to its 
count in memory, or create an entry for it if one does not yet exist. 

— When does PCY beat a-priori? When there are too many pairs of items from 
L\ to ht a table of candidate pairs and their counts in main memory, yet the 
number of frequent buckets in the PCY algorithm is sufficiently small that 
it reduces the size of C *2 below what can ht in memory (even with 1/16 of it 
given over to the bitmap). 

— When will most of the buckets be infrequent in PCY? When there are a few 
frequent pairs, but most pairs are so infrequent that even when the counts 
of all the pairs that hash to a given bucket are added, they still are unlikely 
to sum to s or more. 



1.7 The “Iceberg” Extensions to PCY 

The following is taken from [4]. 

1. Multiple hash tables: share memory between two or more hash tables on pass 
1, as in Fig. 2. On pass 2, a bitmap is stored for each hash table; note that 
the space needed for all these bitmaps is exactly the same as what is needed 
for the one bitmap in PCY, since the total number of buckets represented is 
the same. In order to be a candidate in C* 2 , a pair must: 

(a) Consist of items from Li, and 

(b) Hash to a frequent bucket in every hash table. 

2. Iterated hash tables (Multistage): Instead of checking candidates in pass 2, 
we run another hash table (different hash function!) in pass 2, but we only 
hash those pairs that meet the test of PCY; i.e., they are both from Li and 
hashed to a frequent bucket on pass 1. On the third pass, we keep bitmaps 
from both hash tables, and treat a pair as a candidate in C '2 only if: 

(a) Both items are in Li. 

(b) The pair hashed to a frequent bucket on pass 1. 

(c) The pair also was hashed to a frequent bucket on pass 2. 
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Fig. 3. Multistage hash tables memory utilization 



Figure 3 suggests the use of memory. This scheme could be extended to more 
passes, but there is a limit, because eventually the memory becomes full of 
bitmaps, and we can’t count any candidates. 



When does multiple hash tables help? When most buckets on the first pass 
of PCY have counts way below the threshold s. Then, we can double the 
counts in buckets and still have most buckets below threshold. 

When does multistage help? When the number of frequent buckets on the 
first pass is high (e.g., 50%), but not all buckets. Then, a second hashing 
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with some of the pairs ignored may reduce the number of frequent buckets 
signihcantly. 



1.8 All Frequent Itemsets in Two Passes 

The methods above are best when you only want frequent pairs, a common case. 
If we want all maximal frequent itemsets, including large sets, too many passes 
may be needed. There are several approaches to getting all frequent itemsets in 
two passes or less. They each rely on randomness of data in some way. 

1. Simple approach: Taka a main-memory-sized sample of the data. Run a lev- 
elwise algorithm in main memory (so you don’t have to pay for disk I/O), 
and hope that the sample will give you the truly frequent sets. 

— Note that you must scale the threshold s back; e.g., if your sample is 1% 
of the data, use s/100 as your support threshold. 

— You can make a complete pass through the data to verify that the fre- 
quent itemsets of the sample are truly frequent, but you will miss a set 
that is frequent in the whole data but not in the sample. 

— To minimize false negatives, you can lower the threshold a bit in the 
sample, thus hnding more candidates for the full pass through the data. 
Risk: you will have too many candidates to ht in main memory. 

2. The SON approach [6]: Read subsets of the data into main memory, and 
apply the “simple approach” to discover candidate sets. Every basket is part 
of one such main- memory subset. On the second pass, a set is a candidate if 
it was identihed as a candidate in any one or more of the subsets. 

— Key point: A set cannot be frequent in the entire data unless it is frequent 
in at least one subset. 

3. Toivonen’s Algorithm [7]: 

(a) Take a sample that hts in main memory. Run the simple approach on 
this data, but with a threshold lowered so that we are unlikely to miss 
any truly frequent itemsets (e.g., if sample is 1% of the data, use s/125 
as the support threshold). 

(b) Add to the candidates of the sample the negative border: those sets of 
items S such that S is not identihed as frequent in the sample, but 
every immediate subset of S' is. For example, if ABC D is not frequent 
in the sample, but all of ABC,ABD,ACD, and BCD are frequent in 
the sample, then ABCD is in the negative border. 

(c) Make a pass over the data, counting all the candidate itemsets and the 
negative border. If no member of the negative border is frequent in the 
full data, then the frequent itemsets are exactly those candidates that 
are above threshold. 
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(d) Unfortunately, if there is a member of the negative border that turns out 
to be frequent, then we don’t know whether some of its supersets are also 
frequent, so the whole process needs to be repeated (or we accept what 
we have and don’t worry about a few false negatives). 

2 Low-Support, High-Correlation Mining 

The following material is taken from [3]. We continue to assume a “market- 
basket” model for data, and we visualize the data as a boolean matrix, where 
rows = baskets and columns = items. Key assumptions: 

1. Matrix is very sparse; almost all O’s. 

2. The number of columns (items) is sufhciently small that we can store some- 
thing per column in main memory, but sufhciently large that we cannot store 
something per pair of items in main memory (the same assumption we’ve 
made in all association-rule work so far). 

3. The number of rows is so large that we cannot store the entire matrix in 
memory, even if we take advantage of sparseness and compress (again, sames 
assumption as always). 

4. We are not interested in high-support pairs or sets of columns; rather we 
want highly correlated pairs of columns. 

2.1 Applications 

While marketing applications generally care only about high support (it doesn’t 
pay to try to market things that nobody buys anyway), there are several appli- 
cations that meet the model above, especially the point about pairs of columns 
or items with low support but high correlation being interesting: 

1. Rows and columns are Web pages; (r, c) = 1 means that the page of row r 
links to the page of column c. Similar columns may be pages about the same 
topic. 

2. Same as (1), but the page of column c links to the page of row r. Now, similar 
columns may represent mirror pages. 

3. Rows = sentences of Web pages or documents; columns = words appearing 
in those sentences. Similar columns are words that appear almost always 
together, e.g., “phrases.” 

4. Rows = sentences; columns = Web pages or documents containing those 
sentences. Similar columns may indicate mirror pages or plagiarisms. 
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2.2 Similarity 

Think of a column as the set of rows in which the column has a 1. Then the 
sirmlanty oi two columns C\ and C '2 is Sim{Ci,C' 2 ) = \C\ H C 2 I/IC 1 U C 2 I. 

0 1 
1 0 

11=2/5 = 40% similar 

hxampiel. 

1 1 
0 1 

2.3 Signatures 

Key idea: map (“hash”) each column C to a small amount of data [the signature, 
Sig{C)] such that: 

1. Sig{C) is small enough that a signature for each column can be ht in main 
memory. 

2. Columns Ci and C '2 are highly similar if and only if Sig{C'i) and Sig{C' 2 ) are 
highly similar. (But note that we need to dehne “similarity” for signatures.) 

An idea that doesn’t work: Pick 100 rows at random, and make that string 
of 100 bits be the signature for each column. The reason is that the matrix is 
assumed sparse, so many columns will have an all-0 signature even if they are 
quite dissimilar. 

Useful convention: given two columns Ci and C 2 , we’ll refer to rows as being 
of four types — a,b,c,d — depending on their bits in these columns, as follows: 



Type 


Cl 


C2 


a 


1 


1 


h 


1 


0 


c 


0 


1 


d 


0 


0 



We’ll also use a as “the number of rows of type a,” and so on. 

— Note, Sim[Ci , C 2 ) = a/[a + b + c). 

— But since most rows are of type d, a selection of, say, 100 random rows will 
be all of type d, so the similarity of the columns in these 100 rows is not 
even dehned. 
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2.4 Min Hashing 

Imagine the rows permuted randomly in order. “Hash” each column C to h{C), 
the number of the hrst row in which column C has a 1. 

— The probability that h{C'i) = /i(C* 2 ) is a/{a + h + c), since the hash values 
agree if the hrst row with a 1 in either column is of type a, and they disagree 
if the hrst such row is of type h or c. Note this probability is the same as 

Sim{C\,C2)- 

— If we repeat the experiment, with a new permutation of rows a large number 
of times, say 100, we get a signature consisting of 100 row numbers for each 
column. The “similarity” of these lists (fraction of positions in which they 
agree) will be very close to the similarity of the columns. 

— Important trick: we don’t actually permute the rows, which would take many 
passes over the entire data. Rather, we read the rows in whatever order they 
are given, and hash each row using (say) 100 different hash functions. For 
each column we maintain the lowest hash value of a row in which that column 
has a 1, independently for each of the 100 hash functions. After considering 
all rows, we shall have for each column the hrst rows in which the column 
has 1, if the rows had been permuted in the orders given by each of the 100 
hash functions. 

2.5 Locality-Sensitive Hashing 

Problem: we’ve got signatures for all the columns in main memory, and simi- 
lar signatures mean similar columns, with high probability, but there still may 
be so many columns that doing anything that is quadratic in the number of 
columns, even in main memory, is prohibitive. Locality- sensitive hashing (LSH) 
is a technique to be used in main memory for approximating the set of similar 
column-pairs with a lot less than quadratic work. 

The goal: in time proportional to the number of columns, eliminate as possible 
similar pairs the vast majority of the column pairs. 

1. Think of the signatures as columns of integers. 

2. Partition the rows of the signatures into bands, say I bands of r rows each. 

3. Hash the columns in each band into buckets. A pair of columns is a candidate- 
pair if they hash to the same bucket in any band. 

4. After identifying candidates, verify each candidate-pair (CpCj) by examin- 
ing Sig{C'i) and Sig{Cj) for similarity. 

Example 2. To see the effect of LSH, consider data with 100,000 columns, and 
signatures consisting of 100 integers each. The signatures take 40Mb of memory, 
not too much by today’s standards. Suppose we want pairs that are 80% similar. 
We’ll look at the signatures, rather than the columns, so we are really identifying 
columns whose signatures are 80% similar — not quite the same thing. 
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— If two columns are 80% similar, then the probability that they are identical 
in any one band of 5 integers is (0.8)® = 0.328. The probability that they 
are not similar in any of the 20 bands is (1 — .328)^® = .00035. Thus, all but 
about 1/3000 of the pairs with 80%-similar signatures will be identihed as 
candidates. 

— Now, suppose two columns are only 40% similar. Then the probability that 
they are identical in one band is (0.4)® = .01, and the probability that they 
are similar in at least one of the 20 bands is no more than 0.2. Thus, we can 
skip at least 4/5 of the pairs that will turn out not to be candidates, if 40% 
is the typical similarity of columns. 

— In fact, most pairs of columns will be a lot less than 40% similar, so we really 
eliminate a huge fraction of the dissimilar columns. 



2.6 fe-Min Hashing 

Min hashing requires that we hash each row number k times, if we want a 
signature of k integers. With k-mm hashing. In k-min hashing we instead hash 
each row once, and for each column, we take the k lowest-numbered rows in 
which that column has a 1 as the signature. 

To see why the similarity of these signatures is almost the same as the sim- 
ilarity of the columns from which they are derived, examine Fig. 4. This hgure 
represents the signatures Sigi and Sig 2 for columns C\ and C* 2 , respectively, as 
if the rows were permuted in the order of their hash values, and rows of type d 
(neither column has 1) are omitted. Thus, we see only rows of types a, b, and c, 
and we indicate that a row is in the signature by a 1. 



100 

I’s 



Sigl 

1 

1 

1 

0 



1 

1 

1 



Sig2 

1 

0 

1 

1 



1 

0 

1 



100 I’s 



Fig. 4. Example of the signatures of two columns using fc-min hashing 
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Let us assume c > h, so the typical situation (assuming k = 100) is as shown 
in Fig. 4: the top 100 rows in the hrst column includes some rows that are not 
among the top 100 rows for the second column. Then an estimate of the similarity 
of Sigi and Sig 2 can be computed as follows: 



n Sig2\ 



100a 
a + c 



because on average, the fraction of the 100 top rows of C '2 that are also rows of 
Cl is a/{a + c). Also: 



\Sigi U Sig 2 \ = 100 + 



100c 
a + c 



The argument is that all 100 rows of Sigi are in the union. In addition, those 
rows of Sig 2 that are not rows of Sigi are in the union, and the latter set of 
rows is on average 100c/(a + c) rows. Thus, the similarity of Sigi and Sig 2 is: 



\Sigi n Sig 2 \ _ ^ _ a 

\SigiUSig 2 \ 100+^ a + 2c 



Note that if c is close to h, then the similarity of the signatures is close to the 
similarity of the columns, which is a/{a + h + c). In fact, if the columns are very 
similar, then h and c are both small compared to a, and the similarities of the 
signatures and columns must be close. 



2.7 Amplification of I’s (Hamming LSH) 

If columns are not sparse, but have about 50% I’s, then we don’t need min- 
hashing; a random collection of rows serves as a signature. Hamming LSH con- 
structs a series of matrices, each with half as many rows as the previous, by 
OR-ing together two consecutive rows from the previous, as in Fig. 5. 

— There are no more than logn matrices if n is the number of rows. The total 
number of rows in all matrices is 2n, and they can all be computed with one 
pass through the original matrix, storing the large ones on disk. 

— In each matrix, produce as candidate pairs those columns that: 

1. Have a medium density of I’s, say between 20% and 80%, and 

2. Are likely to be similar, based on an LSH test. 

— Note that the density range 20-80% guarantees that any two columns that 
are at least 50% similar will be considered together in at least one matrix, 
unless by bad luck their relative densities change due to the OR operation 
combining two I’s into one. 

— A second pass through the original data confirms which of the candidates 
are really similar. 
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Fig. 5. Construction of a series of exponentially smaller, denser matrices 



— This method exploits an idea that can be useful elsewhere: similar columns 
have similar numbers of I’s, so there is no point ever comparing columns 
whose numbers of I’s are very different. 
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Abstract. Consider a doctor with a knowledge base KB consisting of 
first-order information (such as “All patients with hepatitis have jaun- 
dice”), statistical information (such as “SOhave hepatitis”), and default 
information (such as “patients with pneumonia typically have fever”). 
The doctor may want to make decisions regarding a particular patient, 
using the KB in some principled way. To do this, it is often useful for the 
doctor to assign a numerical “degree of belief” to measure the strength 
of her belief in a given statement A. I focus on one principled method 
for doing so. The method, called the random worlds method, is a natu- 
ral one: For any given domain size N, we can look at the proportion of 
models satisfying A among models of size N satisfying KB. If we don’t 
know the domain size N, but know that it is large, we can approximate 
the degree of belief in A given KB by taking the limit of this fraction as 
N goes to infinity. 

In many cases that arise in practice, the answers we get using this method 
can be shown to match heuristic assumptions made in many standard 
AI systems. I also show that when the language is restricted to unary 
predicates (for example, symptoms and diseases, but not relations such 
as ’’Taller than”), the answer provided by the random worlds method 
can often be computed using maximum entropy. On the other hand, if 
the language includes binary predicates, all connections to maximum 
entropy seem to disappear. Moreover, almost all the questions one might 
want to ask can be shown to be highly undecidable. 

I conclude with some general discussion of the problem of finding reason- 
able methods to do inductive reasoning of the sort considered here, and 
the relevance of these ideas to data mining and knowledge discovery. 
The talk covers joint work with Fahiem Bacchus, Adam Grove and 
Daphne Koller [1,2]. 
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Abstract. Deduction is usually considered to be the opposite of induc- 
tion. However, deduction and induction can be related in many ways. In 
this paper, two endeavors that try to relate discovery science and verifi- 
cation technology are described. The first is discovery by deduction, 
where attempts to find algorithms are made using verifiers. Case studies 
of finding algorithms for concurrent garbage collection and for mutual 
exclusion without semaphores are described. Superoptimization can also 
be classified as work in this field. Recent work on finding authentication 
protocols using a protocol verifier is also briefly surveyed. 

The second endeavor is discovery for deduction. This concerns the 
long-standing problem of finding induction formulae or loop invariants. 
The problem is regarded as one of learning from positive data, and the 
notion of safe generalization, which is commonly recognized in learning 
from positive data, is introduced into iterative computation of loop in- 
variants. The similarity between the widening operator in abstract inter- 
pretation and Gold’s notion of identification in the limit is also discussed. 



1 Introduction 

Deduction is usually considered the opposite of induction. For example, inductive 
inference of a function means to guess the definition (program) of the function 
from examples of inputs and outputs. On the other hand, deductive inference 
means to derive a theorem by applying inference rules to axioms. However, de- 
duction and induction can be related in many ways. In this paper, we describe 
two endeavors that try to relate discovery science and verification technology. 

An older endeavor, deductive approach to program synthesis [20], aimed at 
synthesizing programs under a deductive framework. This technique is also 
known as constructive programming [24,13,16]. In this approach, the specifi- 
cation of a program that is being synthesized is given as a formula using formal 
logic. The formula takes the form, \fx3y.P{x,y), where x denotes an input and 
y an output. The predicate, P(x, y), is called the input-output relation, and spec- 
ifies the condition that the output y should satisfy with respect to the input x. 
The formula itself is called the existence theorem. 

In the deductive approach, the existence theorem is proved constructively. It 
is not necessary for the underlying logic to be constructive, but if the theorem is 
proved under constructive logic, the proof of\fx.3y.P{x,y) is guaranteed to be 
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constructive. From a constructive proof oi'ix.3y.P{x,y), it is possible to extract 
a function, /, that satisfies \fx.P{x, f{x)). Since the function is represented as a 
term in formal logic, it can be regarded as a program if the symbols in the term 
are interpreted appropriately. A program is thus obtained from its specification. 

This approach, although it elegantly explicates the relationship between pro- 
grams and proofs, is practically infeasible, unless a very powerful theorem prover 
for the underlying logic is available. Since such a prover does not exist in general, 
human assistance is required to prove the existence theorem. If human assistance 
is required at all, then it is not a machine but a human who finds a program be- 
cause it becomes almost obvious for a human to conceive the underlying program 
while he or she writes the proof. This approach, therefore, cannot be considered 
as a method for machine discovery of programs, but as a method for human 
discovery. Proof animation, recently advocated Hayashi, is more explicit in this 
direction [17]. 

In this paper, we describe another approach to synthesizing programs or 
algorithms under a deductive framework. Model checking is a methodology for 
automatically verifying a program by exploring its state space [6]. It has been 
established as a method for automatically verifying hardware, protocols, and 
software (not concrete programs, but algorithms or designs at abstract levels). 
Automatic verifiers employed in model checking are called model checkers. They 
explore the state space constructed from a given program and its specification 
expressed using temporal logic. 

In order to synthesize a program using a model checker, a space of programs 
must first be defined to which the target program is expected to belong. There 
are many ways to define such a space. It can be defined by a set of parameters. 
A program belonging to the space is then defined as a tuple of the values of 
the parameters. Programs can also be written in a programming language. In 
this case, the space consists of programs expressed in the language. Since such a 
space is infinite in general, it is usually necessary to impose restrictions on the 
size of programs. 

An automatic verifier, i.e., model checker, is then invoked on each program 
in the program space to try to verify the specification that the program should 
satisfy. The verifier determines, within a finite time, whether the specification 
is satisfied or not. Therefore, it is possible to search for a program that satisfies 
the specification by simply searching through the program space. 

In this paper, this approach is used to find algorithms for concurrent garbage 
collection and for mutual exclusion without semaphores. These case studies are 
described in detail in the next section. 

Superoptimization, which is a method to automatically synthesize the code 
generation table of a compiler, can also be classified as an attempt in this research 
direction [21, 12, 10]. It searches for a sequence of machine code that corresponds 
to an operation in an intermediate language. 

In this paper, Perrig and Song’s recent work on finding authentication proto- 
cols using a protocol verifier is also briefly surveyed [23] . It is an attempt to syn- 
thesize an authentication protocol using a protocol verifier, called Athena [25]. 
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This work is important for two reasons. One is that authentication protocols 
are very short in general. Since each protocol consists of only a few message 
exchanges, the protocol space to explore is of a feasible size. The other is that 
the correctness of an authentication protocol is very subtle and requires ma- 
chine verification. Since it is not easy for a human to check the correctness of a 
protocol, it is also difficult to find a correct one. 

The second part of this paper describes the relationship between inductive 
theorem proving and learning from positive data. Inductive theorem proving is to 
automatically prove a theorem by mathematical induction. In the field of auto- 
mated deduction, this is the long standing problem of finding induction formulas 
or loop invariants. In order to automate mathematical induction, it is necessary 
to automatically synthesize appropriate induction formulae employed in math- 
ematical induction. In particular, the correctness of a loop program requires a 
formula, called the loop invariant, which always holds whenever the loop body 
is executed. The correctness is proved by mathematical induction using the loop 
invariant as the induction formula. 

If we execute a loop program with concrete input, we obtain a sequence of 
concrete states that the program takes at the beginning of the loop body. If we 
can obtain a general representation of the concrete states, then this representa- 
tion can be used as the loop invariant. This problem can be considered as that 
of learning from positive data, because we only have concrete states that are 
positive examples of the general representation. 

In learning from positive data, it is crucial to avoid over-generalization. Many 
generalization procedures have been proposed that avoid over-generalization and 
run in polynomial time. In this paper, the theory of learning from positive data 
is applied, and the notion of safe generalization is introduced into the iterative 
computation of a loop invariant. This means that iterative computation always 
avoids over-generalization, and if it ever converges, it yields the exact loop in- 
variant. 

If the domain is infinite, however, iterative computation does not always 
converge in a finite number of steps. In the field of abstract interpretation, the 
technique called widening is used to guarantee finite convergence. Therefore, the 
widening operator in abstract interpretation and Gold’s notion of identification 
in the limit show close similarity. 

2 Discovery by Deduction — Searching for Algorithms 
by Automatic Verifiers 

Model checking is a method used to exhaustively search through a state space to 
verify that error or starvation states are not reachable from an initial state [6]. 
A brief explanation of model checking follows. 

Any computer system, whether it is hardware or software, can be modeled 
as a state transition system, which consists of a set of states and the transition 
relation between those states. A state in this context means a state that the entire 
computer system takes at a certain point of time. If a system is comprised of 
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multiple processes, a state of the entire system is a tuple of states, each taken by 
an individual process. The transition relation is a binary relation between states 
that determines, for each state, which states can follow it. A state transition 
system, therefore, comprises a directed graph whose nodes are states and whose 
edges represent the transition relation. 

There are two kinds of properties that should be verified with respect to a 
state transition system: safety and liveness properties. To verify a safety property 
is to repeat state transitions from an initial state and ensure that error states 
are not reachable. This is reduced to a reachability problem in a directed graph. 
To verify a liveness property is to check whether starvation of a certain kind may 
occur under some fairness conditions. If a system is comprised of two processes, 
fairness roughly means that both processes are executed equally often, i.e, they 
are scheduled with an equal chance. There are many kinds of fairness conditions, 
though, and care must be taken as to what kind of fairness should be assumed 
on a particular system. Starvation means that a process is blocked infinitely, to 
wait for some resources to be released, under the assumption that fairness holds. 
If starvation never occurs, the system is said to satisfy liveness. 

The detection of starvation is more difficult than that of error states, because 
starvation is not a property of a single state. A search must be made for an 
infinite execution path that satisfies both starvation and fairness. If the state 
space is finite, this problem is reduced to that of finding a loop that satisfies 
certain properties in a directed graph. 

In this paper, we are interested in using verifiers to find algorithms. We define 
a space of algorithms, and search through it to find an algorithm that satisfies 
safety and/or liveness. 

In the rest of this section, attempts made to find algorithms in the following 
fields are described: 

— concurrent garbage collection, 

— mutual exclusion without semaphores, 

— superoptimization, 

— authentication protocols. 



2.1 Concurrent Garbage Collection 

The first case study using model checking is finding algorithms for concurrent 
garbage collection. Such an algorithm is comprised of two processes: a collector 
that collects garbage cells, and a mutator that manipulates cell pointers and does 
some computations. Algorithms such as on-the-fiy [9] and snapshot [27,28] are 
well known for concurrent garbage collection. Although these two algorithms are 
based on completely different ideas, they can be modeled in a uniform framework. 
For example, both employ four cell colors: white, black, gray, and free. The color 
“free” means that a cell is free, i.e., allocatable. 

In our framework, the mutator refers to some registers, each holding a pointer 
to a cell. The collector also refers to the registers when it marks the cells in use. 
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Since the collector begins marking with the pointers held in the registers, a 
register is also called a root. 

In this study, we assume that each cell has only one field to simplify the 
framework. (Fortunately, this restriction turned out to be irrelevant for the al- 
gorithms discovered.) Figure 1 shows registers and heaps. 




cells 

Fig. 1. Registers and cells. 



The collector takes the following four steps. 

— shade: The collector makes all cells that are directly reachable from a register 
gray. This step is executed with the mutator stopped. 

— mark: The collector selects a gray cell and makes it black. If it refers to 
another cell which is white, then the white cell is made gray. If there is no gray 
cell, the collector goes to the next step. This step is executed concurrently 
with the mutator. 

— append: The collector selects a white cell and makes it free. If there is no 
white cell, the collector goes to the next step. This step is executed concur- 
rently with the mutator. 

— unmark: The collector selects a black or gray cell and makes it white. If there 
is no black or gray cell, the collector goes back to the first step, shade. This 
step is executed concurrently with the mutator. 

The mutators used in the on-the-fiy and snapshot algorithms have different 
operations. In this case study, thirteen operations are defined that cover both 
algorithms. In the following description, R[i] denotes the contents of the i-th 
register, and F[i] denotes the contents of the field of the i-th cell. Indices of 
cells begin with 1, and the index 0 denotes nil (the null pointer). The procedure, 
white_to_gray(i) , makes the white i-th cell gray, and i is not 0. 

( 0 ) Allocate a new cell in the shade step. 

(1) Allocate a new cell in the mark step. 

(2) Allocate a new cell in the append step. 

(3) Allocate a new cell in the unmark step. 
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( 4 ) Make a newly allocated cell gray. 

( 5 ) Make a newly allocated cell black. 

(6) white_to_gray(F [R[i] ] ) ; F[R[i]] := 0; 

(7) F[R[i]] := 0; 

(8) R[i] := F[R[j]]; white_to_gray(R[i] ) ; 

(9) R[i] := F[R[j]] ; 

(10) F[R[i]] := R[j] ; white_to_gray(F [R[i] ] ) ; 

(11) F[R[i]] := R[j] ; 

(12) white_to_gray(F [R[i] ] ) ; F[R[i]] := R[j] ; 

Allocation of a cell is accomplished by a combination of one of (0), (1), (2) or 
(3), and one of (4) or (5). Each of (0), (1), (2) and (3) corresponds to a collector 
step. 

An error state occurs when a cell that is reachable from a register (root) 
becomes free. If error states are not reachable from an initial state, an algorithm 
for concurrent garbage collection is said to be safe. Although liveness is also re- 
quired for concurrent garbage collection, we search for algorithms only according 
to the safety property. 

In this case study, we searched for algorithms with respect to a finite model. 
The model consists of only three cells and three registers. A state of the model 
can therefore be represented by 20 bits. The only initial state is that in which 
ah the registers hold the 0 value, ah the cells are free, and the collector is in the 
shade step. 

We defined the algorithm space according to whether or not each of thirteen 
operations is allowed. The number of algorithms in the space is therefore 2^^. 
We applied finite model checking to each algorithm and computed the maximal 
algorithms that satisfy the safety property, i.e., do not reach an error state. Al- 
gorithms are maximal if they allow as many operations as possible. The verifier, 
i.e., model checker, was implemented in C. 

The following maximal algorithms were identified: 



- 0000111111111 
- 1111001111111 
- 0111111110111 
- 1111101110111 
- 0111111011001 
- 1111101011001 
- 1111111110100 
- 1111111111000 



no allocation 
no allocation 
a variant of on-the-fiy 
on-the-fiy 
snapshot 

a variant of snapshot 
a variant of on-the-fiy 
no field update 



Each algorithm is represented by thirteen bits corresponding to the thirteen op- 
erations, where 0 indicates that the corresponding operation is inhibited, and 
1 indicates that it is allowed. Among these algorithms, the first two do not 
permit allocation of new cells. The last one does not permit the fields to be 
updated. Therefore, only the following five algorithms are meaningful for con- 
current garbage collection: 



on-the-fiy and its two variants. 
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— snapshot and its one variant. 

There are no other (correct) algorithms that consist of the above operations. 
Before the experiment, it was expected that there might be algorithms that were 
different from on-the-fiy and snapshot, but the discovered algorithms can all be 
classified into one of these two kinds. Liveness of concurrent garbage collection 
means that garbage cells (those cells that are not reachable from a register) are 
eventually collected (and made free). The algorithms identified above all satisfy 
the liveness property. Therefore, in this case study, the safety property was 
sufficient to permit discovery of the appropriate algorithms. Furthermore, the 
correctness of the discovered algorithms was verified independent of the number 
of cells or registers, using the technique of abstraction [26]. 

2.2 Mutual Exclusion without Semaphores 

This case study looks for variants of Dekker’s algorithm for mutual exclusion 
among processes. In the previous study, the algorithm space was defined by a 
fixed set of operations. In other words, the space was comprised of tuples of 
thirteen boolean values. In this case study, we define the space of programs 
consisting of pseudo-instructions. 

Dekker’s algorithm realizes mutual execution among processes without 
semaphores. Figure 2 shows an instance of the algorithm for two processes. It 
can be generalized to an arbitrary number of processes, but we consider only 
two processes in this paper. In the figure, me denotes the number of the process 
that is executing the code (1 or 2), and you denotes the number of the other 
process (2 or 1). The entry part realizes mutual execution before entering the 
critical section, and the finishing part is executed after the critical section. The 
idle part represents a process-dependent task. 

The safety property of Dekker’s algorithm is: 

Two processes do not simultaneously enter the critical section. 

Liveness is: 

There does not exist an execution path (loop) that begins and ends with 

the same state, at which one process is in its entry part, and satisfies the 

following conditions. 

— The process stays in the entry part on the execution path, i.e., it 
does not enter the critical section. 

— Both processes execute at least one instruction on the execution path. 

In this case study, liveness was also checked during the search, using the 
nested depth-first seareh. This is a technique developed for SPIN, one of the 
more popular model checkers [18]. The verifier, as well as the search problem 
discussed later, was implemented in Java. 

We represent Dekker’s algorithm using pseudo-code consisting of pseudo- 
instructions. The pseudo-code may refer to three variables, each of which holds 
a boolean value. 
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for (;;) { 

// beginning of the entry part 
flags [me] = true ; 
while (flags [you] == true) { 
if (turn != me) { 

flags [me] = false; 
while (turn != me) {} 
flags [me] = true; 

} 

} // end of the entry part 

// the critical section 

// beginning of the finishing part 

turn = you; 

flags [me] = false; 

// end of the finishing part 

// the idle part 

} 



Fig. 2. Dekker’s algorithm. 



— FLAGl: This variable corresponds to flags [1] in Figure 2. 

— FLAG2: This variable corresponds to flags [2] in Figure 2. 

— TURN: This variable corresponds to turn in Figure 2. TURN=true means 

turn=l, and TURN=false means turn=2, 

The set of pseudo-instructions are: 

— SET variable 

— CLEAR variable 

— IF variable {instruetionsy 

— IF_N0T variable {instruetionsy 

— WHILE variable {instruetionsy 

— WHILEJJOT variable {instruetionsy 

The entry part of process 1 is represented by the following pseudo-code. 

SET FLAGl 
WHILE FLAG2 { 

IF_N0T TURN { 

CLEAR FLAGl 
WHILE_N0T TURN {} 

SET FLAGl 

} 




Discovery and Deduction 



25 



In this case study, we searched for variants of the entry part of Dekker’s algo- 
rithm that satisfy both safety and liveness. The entry parts of the two processes 
were assumed to be symmetric in the following sense. If one refers to FLAGl (or 
FLAG2), then the other refers to FLAG2 (or FLAGl). If one contains an instruc- 
tion that refers to TURN, then the other contains the corresponding symmetric 
instruction, where IF and IFJJOT, WHILE and WHILEJJOT, and SET and CLEAR are 
symmetric to each other. 

In the first experiment, we searched for a 5-instruction pseudo-code consisting 
of the following instructions. Note that the original entry part consists of 6 
instructions. 

- WHILE FLAG2 . . . 

- IF FLAG2 . . . 

- WHILEJJOT TURN . . . 

- IFJJOT TURN . . . 

- SET FLAGl 

- CLEAR FLAGl 

We discovered the code shown in Figure 3(a). This code, however, is equivalent 
to the original code of Dekker’s algorithm, in the sense that they have the same 
meaning as sequential programs. We also found more than ten variants similar 
to the above, all equivalent to the original code as sequential programs. 

In the next experiment, we imposed the following restriction. 

If both processes are in their entry or finishing parts, they run with 
the same speed, i.e., they both execute one instruction in a single state 
transition of the entire system. If they read or write to the same variable 
simultaneously, it is nondeterministic which process wins (runs first). If 
both processes are not in their entry or finishing parts, one process is 
chosen nondeterminisitically and allowed to execute one instruction. 

Under this restriction, we discovered the code in Figure 3(b), which consists of 
4 instructions. This is the only 4-instruction code that we discovered. We also 
searched for a 3-instruction code, but failed. 



SET FLAGl 
WHILE FLAG2 { 

WHILE_N0T TURN { 
CLEAR FLAGl 

} 

SET FLAGl 



(a) 



WHILE FLAG2 {} 

SET FLAGl 
IF_N0T TURN { 

WHILE FLAG2 {} 

} 



(b) 



Fig. 3. Generated pseudo-code. 
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The correctness of the discovered code is not obvious. We believe that finding 
algorithms by verifiers is effective in situations where complex conditions are 
imposed that are difficult for humans to grasp. 



2.3 Superoptimization 

Superoptimization [21,12,10] is the most practical and successful work in this 
field. It is a technique used to synthesize the code generation table of a compiler. 
Given the specification of a code sequence, corresponding to some operation in 
an intermediate language, it searches for sequences of machine code that satisfy 
that specification. 

Massalin initiated this technique, later made popular by Granlund and Ken- 
ner, who used it to synthesize a code generation table for the GCC compiler. 
The code in Figure 4(a) was found by Massalin’s superoptimizer for Motorola’s 
68020. This is the code for the following function (written in C): 

int signum(int x) { 

if (x>0) return 1; 
else if (x<0) return -1; 
else return 0; 

} 

Recently, Mizukami repeated Massalin’s work for Sparc, and obtained the code 
in Figure 4(b) [22]. 



(x in dO) 
add.l d0,d0 
subx.l dl,dl 
negx.l dO 
addx.l dl,dl 
(signum(x) in dl) 

(a) Motorola 68000 



(x in */,o0) 

addcc */,o0, */,o0, */,10 
subxcc */,o0, */,10, */,10 
addxcc */,o0, */,10, */,o0 
(signum(x) in */,o0) 

(b) Sparc 



Fig. 4. Code generated by superoptimizers. 



Superoptimization does not employ a real verifier; it only checks the correct- 
ness of the discovered code using some random numbers. This is sufficient in 
practice, since a human has the final decision as to whether to incorporate the 
code sequences obtained into the code generation table or not. In this final step, 
the correctness of the generated code can be manually checked. 

2.4 Authentication Protocols 

Finding algorithms using verifiers is effective in a field where the algorithms 
are very short, but still require substantial effort to verify their correctness. 
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Security protocols, such as those used for authentication, are good examples of 
such algorithms. 

Perrig and Song [23] recently used Song’s protocol verifier, Athena [25], to try 
to discover symmetric-key and asymmetric-key mutual authentication protocols 
that satisfy the agreement property. They also defined a metric function that 
measures the cost or overhead of protocols. This function was intended to be 
applied to correct protocols to select the most efficient. The function was also 
used to restrict the set of generated protocols. 

After setting UNIT_ELEMENT_C0ST=1 (cost to send a nonce or a principal 
name), NEW_N0NCE_C0ST=1 (cost to generate a new nonce), and 
ASYM_ENCRYPTI0N_C0ST=3 (cost to encrypt a message with an asymmetric key), 
they succeeded in discovering the following protocol among those whose cost is 
less than or equal to 14: 

Protocol : A ^ B : {Na, A}kb 

B^A:{Na,Nb,B}k^ 

A^ B -Nb 

The protocol coincides with the Needham-Schroeder protocol, except for the last 
message. In the Needham-Schroeder protocol, the last message is {Nb}kb ■ This 
difference occurred because they did not include secrecy in the specification. 

The symmetric mutual authentication protocols they found, shown in Fig- 
ure 5, are more interesting. All these protocols are simpler than any known previ- 
ously. The first protocol was found using a metric function whose 
UNIT_ELEMENT_COST is high, while the last two were found using a metric function 
whose SYM_ENCRYPTION_COST is high. 



A^ B ■. {Na,A}k^b A^B :Na,A A^B :Na,A 

B ^ A: {Na,Nb}k^b B ^ A: {Na,Nb,A}k^b B ^ A : {Na, Nb , B}k^b 
A^ B : Nb A^ B: Nb A ^ B : Nb 



Fig. 5. Protocols generated by the automatic protocol generator. 



Perrig and Song demonstrate the applicability of their method based on 
these results. Their success is due mainly to their efforts to reduce the number 
of protocols by imposing reasonable restrictions, and to prune protocols by cheap 
tests before applying the protocol verifier. 

3 Discovery for Deduction — Reachability Computation 
and Learning from Positive Data 

The problem of synthesizing a loop invariant should have close relationship with 
that of learning from positive data [2,5]. In this section, the problem of char- 
acterizing reachable states of a state transition system is discussed. This is a 
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problem of finding a finite representation of the set of states that are reachable 
from an initial state. 

Assume a state transition system. Let S be the set of all states of the system, 
and ^ be the transition relation between states. The function / can be defined 
that maps a subset of S to a subset of S as follows: 

f{S) = {s' G S I s G 5, s ^ s'} 

This map is called the state transformer. 

Assume a single initial state, sq. Let be the set of states that are reachable 
from So, 

S. = [j .f({so}) C S 
*>o 

The goal is to obtain a finite representation of by inductive inference, i.e., to 
infer from examples in S^. In the following, is termed the target language. 

3.1 Safe Generalization 

Let S be a set of data and Lq be a language family containing 0. In our case, 
S is the set of all states of the state transition system under consideration. A 
language is a subset of S, and a language family is a set of languages. We assume 
that the target language, 5^^ C S, belongs to Lq. 

An example of a language family is {TVCsY, which is a union of at most 
k tree patterns [3]. In this case, S is the set Te of closed terms composed of 
function and constant symbols taken from E. A tree pattern is a term composed 
of functions symbols, constant symbols, and variables. A tree pattern represents 
a set of closed terms that are instances of the tree pattern. If pi, • • • ,p„ are tree 
patterns, where n < k, then pi, • • • ,p„ represents 

L(pi) U • • • U L(p„) C S, 

where L{pi) denotes the set of closed instances of pi. 

Let L be a language family containing Lq, i.e., Lq C L. For example, we can 
take {TVjC.eY'" as L in the previous example, where Lq = {TVCeY ■ 

Let S' be a subset of i.e.. S' C S^, and L be a language belonging to 

L, i.e., L G L. We say that L is a safe generalization of 5', if the following 
conditions are satisfied: 

- S' CL. 

— For any Lq G Lq, if S' C Lq, then S' C L C Lq. 

Since 5^^ G Lq, T C 5^^ is guaranteed by the second condition. 

3.2 Reachability Computation by Safe Generalization 

Let Sq = {so}- Recall that S^ was defined as follows: 

Su = [j f{SQ) C S 
*>0 
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LGt Z/Q — 0 G L. 

Assume that we have 5, and L, such that 5, C S^, Li C S^, and L, G L. Let 
S' = SiU Li, and let Lj+i be a safe generalization of S' . We then define as 
follows. 

5'j+i = f{Li+i) 

Since Lj+i is a safe generalization of S' and S' C 5^^, Lj+i C 5^^ holds. Therefore, 
C also holds. It can be shown that 5^^ = Uj>o Lq C Li C 

L2C ■■■. 

The final question to ask is whether 5^^ is identifiable in the limit [11], i.e., 
does Li = hold after a finite number of steps. This question can be rephrased 
as whether there exists i such that 5, = f{Li) C Lj. (We later discuss this 
problem in relation to the widening operator in abstract interpretation.) 

Reachability computation using safe generalization explained so far is sum- 
marized in Figure 6. 



Algorithm Reachability; 

S := {so}; 

L := 0; 

repeat 

L := a safe generalization of 5 U L; 

5 := /(T); 

until S G L] 
output L; 



Fig. 6. Reachability computation using safe generalization. 



Arimura et al. defined a polynomial time algorithm, A:-MMG, that computes 
a fc-mmg (fc-minimal multiple generalization) of a given set S of closed trees [3] . 
Their algorithm returns a minimal language of {TVCsY containing S, i.e., a 
set of at most k tree patterns, whose union covers S, and which is minimal 
among such patterns. They incorporated this algorithm into Angluin’s inference 
machine, as shown in Figure 7 [1]. In the figure, minl{S) denotes a fc-mmg of S. 

There are several problems with this algorithm. First, it is only defined 
for closed trees. This problem can be easily avoided, though, if variables in 
S are regarded as constants. Second, the algorithm only returns one of the k- 
mmg’s, whereas, in general, there are more than one k-mmg. For example, if 
S = {/(a, h),f{a', h),f{a, b')}, then both {/(a, y),f{a',b)} and {f{x, b),f{a, b')} 
are 2-mmg’s. This problem is crucial, because Lg. is not guaranteed to be a sub- 
set of the target language. Third, the algorithm always looks through S. The 
algorithm is not local and is not incremental. Therefore, A:-mmg’s cannot be 
employed for safe generalization. 

Theoretically, what we require is strongly monotone learning from text (pos- 
itive data) as described by Lange and Zeugmann [19]. Moreover, for practical 
applications, we believe that updates of hypotheses should be loeal and inere- 
mental. 
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procedure M; 

input: an infinite sequence wi,W 2 , • • • of strings; 
output: an infinite sequence gi,g 2 , - ■ ■ of guesses; 

begin 

set go to be the null index, i.e., = 0, set 5 = 0 and set i = 0; 

repeat 

read the next example Wi and add it to 5; 
if Wi ^ Lg._^ then let gi be minl{S) else let gi be gi-i', 
output gi and let i be i + 1; 
forever; /* main loop */ 
end. 



Fig. 7. Angluin’s inference machine M. 



For our purpose, the framework of learning using queries is more suitable 
than that of simple inductive inference, though we cannot ask queries in general. 
Arimura et al. formulated the procedure in Figure 8, which learns unions of at 
most k tree patterns using subset queries [4]. This framework is considered to 
be an instance of the more general framework of learning from entailmant [5] . 
In particular, the generic algorithm for learning from entailmant proposed by 
Arimura and Yamamoto shows a close correspondence with our algorithm in 
Figure 6. This is because their algorithm only generates hypotheses that are 
subsets of the target language. 

In their algorithm in Figure 9, Ff* denotes the target concept, and EQ{H) 
is called an entailment equivalenee query, which asks whether hypothesis H is 
equivalent to Ff*. The other kind of query, SQ(C), called a subsumption member- 
ship query, asks whether clause C is subsumed by Ff* . Generalization of clauses 
C and D is denoted by Gen{C, D). (Since their algorithm is formulated in induc- 
tive logic programming, the target concept is assumed to be a Horn sentence.) 

Note that their algorithm only generates hypotheses that are subsets of the 
target, and therefore couterexamples returned by EQ are always positive exam- 
ples. The check, 5 C L, in our algorithm in Figure 6 therefore corresponds to 
EQ. 

As for EQ, by employing safe generalization, our algorithm uses only general- 
izations that pass SQ. In other words, safe generalization makes SQ unnecessary. 
In the next section, we give an example of safe generalization that is local and 
incremental. 



3.3 Example of Safe Generalization — Depth-Bonnded Tree 
Patterns 

An example of safe generalization is presented in this section. Let Lq be the 
family of unions of patterns whose maximum depth is at most D. Since Lq is of 
a finite cardinality, it is theoretically inferrable from positive data [2]. Let [p] 
be the maximum depth of tree pattern p. It is formally defined as follows: 



\x~\ =0 (x : variable) 
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Procedure: LEARN 

Given: the equivalance and the subset oracles for the target set Ht 6 TP^. 

Output: a set H of at most k tree patterns equivalent to Ht. 

begin 

H := 0; 

until EQUIV{H) returns “yes” do 

begin 

let w be a counterexample returned by the equivalence query; 
if there is some h € H such that SUBSET{h U w) returns “yes” then 
generalize H by replacing h with hUw 
else if |f?| < k then 

generalize H by adding w into H 
else 

return “failed” 

endif 

end /* main loop * / 
return H-, 

end 

Fig. 8. A learning algorithm for TP’^ using equivalence and restricted subset queries. 



Algorithm EntLearn; 

H := 0; 

while EQ{H) returns “no” do begin /* H, ^ H */ 
Let F be a counterexample returned by the query; 
D := Missing{E, H)-, 

foreach C € H do make query SQ{Gen{C, D))-, 
if the query returned “yes” for some C € H then 
H := {H-{C}) U {Gen{C,D)} 
else 

H := HU{E}-, 
end /* while */ 
return iL; 



Fig. 9. A generic algorithm for learning from entailment. 
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[c] = 1 (c : constant) 

- ■ ■ ,Pn)^ = max([pi],- • • , [p„]) + 1 

Let us write L{p) for the set of instances of tree pattern p. L{p) contains not 
only closed terms, but also tree patterns that are instances of p. Then Lq can 
be formally defined as follows. 

Lo = {T(pi) U • • • U L(p„) I \pi~\ <D] 

Note that L{p) henceforth denotes the set of all the instances of p, including tree 
patterns. 

Let L be the family of unions of patterns whose minimum depth is greater 
than D. 

L = {L{pi) U • • • U L{pn) I [pi\ > D) 

The minimum depth, [pj , of tree pattern p is defined as follows: 

\_x\ =0 {x ■. variable) 

[cj =00 (c : variable) 

\_f{pi,-",Pn)\ = min([pij,---, [p„J) + 1 

Let a,(3, - ■ ■ denote sequences of positive integers. For tree pattern p and a, 
p\a is defined as follows: 



p|e = P (e: empty sequence) 
f{pi, ■ ■ ■ ,Pn)\ia = Pi\a 

We say that a is the position of p\a in p. The length of a, denoted by |q:|, is 
called the depth of p\a in p. 

Consider the following conditions: 

— qi, - ■ ■ ,qn are tree patterns. 

— \qi\ < D iov 1 <i < N . 

— P 1 ,P 2 G L(qi) U • • • U L(qN). 

— LpiJ> \P2\ > D. 

— p = p[x] = lgg(pi,p 2 ), where lgg(pi,p 2 ) denotes the least general general- 
ization of Pi and p 2 , and x = xi, ■ ■ ■ ,Xn is a sequence of variables that are 
introduced in lgg(pi,p 2 ). 

— Pi = p[t ] , where f = fi , • • • , is a sequence of tree patterns {p[t] denotes 
the result of substituting t for x in p[x]). 

— b}\>D. 

— There do not exist two subpatterns r[x] and s[a;] of p that satisfy the fol- 
lowing conditions. 

• p\a = r[x] and p\p = s[a;]. 

• \ct\ < D and |/3| < D. 

• r[x] ^ s[a;]. 

• r[t] = s[f]. 
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Theorem 1. If the above eonditions are all satisfied, then p € Liep) for some 
Qi- 

(proof) Assume pi G L(qi), qi = qi[y] and pi = qi[u], where y = 

is a sequence of variables, and u = ui, • • • ,Um is a sequence of tree patterns. 

Since \j)\ > D, those subpatterns of p\ that are replaced with variables in p are 
all at depth greater than D. Therefore, since \qi\ < D, those patterns are all 
subpatterns of some Uj. 

If yj occurs only once in qi, we can define Vj[x] = p\a, where yj = qi\a- We 
then have Uj = Vj [f] . Assume that yj occurs more than once in qi . 

% = •••%•••%••• 

Pi = • • • Uj • • • Uj • • • 

Let r[x] and s[a;] be the corresponding subpatterns in p. 

p = ■ ■ ■ r[x] ■ ■ ■ s[a;] • • • 

Pi = ■ • -r[f] • • -s[f] • • • 

By the last condition of the theorem, r[x] and s[a;] must be equal. We can then 
define Vj[x] = r[x]{= s[a;]). We also have Uj = Vj\t\. 

Finally, we have 

p = qi[vi[x\,...,Vm[x\], 
and p G L(qi). (end of proof) 

We now assume that the target language, 5 ^^, belongs to Lq. We also assume 



S' = L{pi)U---UL{p„). 



If \Pi\ < D, we instantiate those variables that occur in pi at depth less than or 
equal to D and obtain pn, - • • ,Pim such that \j>ij\ > D and 

^{Pi) — U * * * U L{pirn)‘ 

(We assume E is finite.) We then replace pi with pn, ■ ■ ■ ,Pim- Therefore, we 
can always assume > D for all i, so S' G L. In this context, we have the 
following theorem for safe generalization. 

Theorem 2. We piek p\ and p 2 from S' and eompute p = \gg(pi,p 2 ). If [pj > 
D, then we eheek the last eondition of Theorem 1. If the eondition holds, we 
replaee p\ and p 2 in S' with p. The resulting set of tree patterns is a safe gener- 
alization of S' . 

The final question to ask is whether the sequence of languages converges. Let 
qi = qi[y] and qi\a = yj, where a is the most shallow position of yj in qi. Take 
P such that afi = £> + !. If pi and p2 satisfy the following conditions, then we 
can apply safe generalization to pi and p 2 ■ 

- Pi = qi[u] and p2 = qi[u'\. 
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— [pij > D and [P 2 J > D. 

— Uk =u'f. unless k = j. 

— Uj\-y and u'j\-y begin with the same symbol unless 7 = /3<5 form some <5. 

— Uj\p and u'j\p begin with different function (or constant) symbols. 

— Uj\p does not occur at any position of except at a'/3, where qi\a' = Vj- 

By these conditions, a variable is introduced at position aP of p\ by safe gener- 
alization. 

During the iterative computation of L,, such pi and p 2 should appear, since 
any term in (or its generalization) is eventually generated. This means that, 
for any such jjj and /3, it is always possible to introduce a variable at /3 in Uj. 

For the last condition (uj \p does not occur ■ ■ ■), S must have a sufficient num- 
ber of function or constant symbols. In some cases, we should further instantiate 
Pi and p 2 to guarantee this condition. 

3.4 Widening and Identification in the Limit 

Abstract interpretation is a methodology for analyzing programs using an ab- 
stract domain whose elements correspond to a set of program states. An abstract 
domain forms a lattice, and it is possible to obtain the general representation of 
a loop invariant using simple iterative computation. 

If the abstract domain is finite, such iterative computation always terminates. 
More formally, if Pq < -Pi < P 2 < • • • is a non-decreasing infinite sequence of 
elements of the lattice, then there exists n such that for all i >n, Pi+i = Pi. 

However, if the abstract domain is infinite, a non-decreasing sequence does 
not always terminate. In abstract interpretation, widening operators are em- 
ployed to guarantee termination. A binary operator, V, is termed a widening 
operator [7] if it satisfies the following conditions: 

— P < PVQ and Q < PVQ. 

— For any non-decreasing infinite sequence Po < P\ < P 2 < • • • , the sequence, 
Qo <Qi <Q 2 < ■ ■ ■, defined below finitely converges. 

• Qo = Pb- 

* Q i-\-l — Q Pip-1 ‘ 

The last condition means that there exists n such that for all i >n, Qj+i = Qi- 
This condition corresponds to that of identification in the limit as described by 
Gold [11]. 

However, widening operators do not preserve the limit in general. The limit 
oi Qq < Qi < Q 2 < • • • may be greater than that of Pq < P ’1 < P ’2 < • • •• 

An example of a widening operator follows. Let the abstract domain be a set 
of closed intervals [l,u] on the real axis. It forms a lattice with respect to set 
inclusion. A widening operator for this domain is 

,u'] = [if I' < I then —00 else I, it u' > u then -foo else u] 

Another example of a widening operator is that of polyhedra analysis [8, 15]. 
The abstract domain is the set of polyhedra in R" . It forms a lattice with respect 
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to set inclusion. (The previous example is a special case where n = 1.) Let P 
and Q be polyhedra. They can be expressed as sets of linear inequalities. Then 
PVQ can be defined as the set of linear inequalities of P that are satisfied by 
Q. The following is a refinement of this operator [14, 15]: 

All the inequalities of P satisfied by Q are kept in the result, together 
with all the inequalities of Q that are mutually redundant with an in- 
equality of P, i.e., saturated by the same vertices and rays of P. 

Widening operators are employed to guarantee termination or accelerate con- 
vergence. However, they may cause information to be lost in the sense that con- 
vergence does not reach the exact limit. Therefore, it is important to combine 
safe generalization and widening. If QiVPi+i is a safe generalization of QiUPi+i , 
then the sequence Qo < Qi < Q2 < • • • reaches the exact limit. 

4 Concluding Remarks 

In the first endeavor discussed in this paper, discovery by deduction, the issue 
of how to decrease the size of the search space used to find programs is the most 
crucial issue. For example, Perrig and Song pruned protocols before applying the 
protocol verifier. More research should be performed to obtain general principles 
for efficient search for programs. 

The notion of safe generalization was introduced in the second endeavor. 
Since many techniques for learning from positive data have been proposed, ap- 
plication of these techniques to reachability computation is fruitful. In particular, 
it should provide many leads to designing widening operators. 
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Abstract. One of the important technologies in knowledge discovery is 
to access the desired information from the large amount of data stored on 
the WWW. At present, such information can be accessed by a browser 
itself or by using a keyword search function. However, browsing is a time 
consuming task where a user must access individual pages one by one. 
Furthermore, it is hard for users to provide reasonable keywords to dis- 
cover their desired pages in general. This paper outlines an approach of 
integrating information visualization and retrieval to improve effective- 
ness WWW information access. In this approach, the link structure of 
WWW is displayed in a 3-D hyperbolic tree in which the height of a node 
within the tree indicates a user’s “interestingness” . Here, interestingness 
is calculated by a fitting function between a page and user-supplied key- 
words, and this measure can be used to filter irrelevant pages, reducing 
the size of the link structure. Such functions are incorporated within 
our browser, allowing us to discover desired pages from a large web site 
incrementally. Relatively large web sites were selected to show the per- 
formance of the proposed method with improved accuracy and efficiency 
in WWW information access. 



1 Introduction 

It is natural to regard WWW as a repository for knowledge discovery on the 
Internet. At present, such information can be accessed by a browser itself or by 
using a keyword search function. However, it is difficult to discover desired web 
pages using these functions due to the following reasons. 

— Lack of global view of WWW information 

There are no functions for viewing the overall structure of WWW informa- 
tion. Thus, a user must access individual pages one by one. 

— Lack of query navigation function 

Although keyword search or database-like queries realize efficient information 
access, it is hard for users to provide reasonable queries to discover their 
desired pages in general. 
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Some solutions to the first problem have recently been provided. The hyper- 
bolic tree, developed by Lamping [Lamping 95], is known among the human- 
interface community as a reasonable tool for providing a global view of WWW 
information. It presents a GUI for hierarchical information structure including 
WWW and allows displaying the whole information structure within a hyper- 
bolic plane. It also supports focus of attention by mouse operation; thus a user 
can see the structure at any angle. 

As for the second problem, an automatic or semi-automatic structuring method 
for WWW information has been reported in the database community. Ashish, 
et al. [Ashish 97] and Atzeni et al. [Atzeni 97] developed “wrappers” to struc- 
ture multiple web sites and to construct a WWW information database. Such 
approaches are useful for effective information retrieval by integrating various 
Internet sources and realizing SQL-like queries. 

Although the above successful results were obtained by distinct communities, 
we hope these can be integrated to give birth to supporting tools for knowledge 
discovery from WWW information. A hyperbolic tree approach is a compre- 
hensive visual tool, but it is difficult to access a large amount of information 
using mouse operation only. In contrast, a database approach realizes efficient 
information access, but still ignores interactive retrieval processes in knowledge 
discovery. This paper conjectures that a combination use of information visual- 
ization and retrieval functions can navigate users to their desirable pages with 
improved accuracy and efficiency in WWW information access. 

The rest of the paper is devoted to describe how visualization and retrieval 
functions are reconstructed based on the previous results and are integrated into 
a WWW information access system (WIAS). The information retrieval part of 
WIAS calculates the fitness of a web page to a user-supplied query. The visual 
part of WIAS draws a hyperbolic tree in which each node corresponds to a web 
page and the height of a node is the fitness of the page to the query. A web 
page with low fitness is filtered; thus the structure of interesting pages only are 
displayed as a hyperbolic tree. These functions realize interactive information 
retrieval, allowing us to access WWW information effectively by cooperating 
with existing browsers. 

2 Interactive Information Retrieval 

2.1 Information strnctnre and operations 

The target information structure is defined as a directed graph consisting of a set 
of texts (usually web pages) and their links (specified by anchor tags in HTML) . 
Each text is composed of combinations of tags and strings. A tag is either a 
prespecified one like HTML or a user-defined one like XML. In general, a tag 
is specified as <tag>str</tag> where str is a string. For text t, we assume that 
tag.str € t holds if t contains <tag>str</tag> . 

A string may contain tags that are nested. Let n be the depth of a tag 
structure. This case is denoted as tag\. ■ ■ ■ .tagn-str € t. For example, we have a 
text 
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<chapter><section>sl</sectionX/chapter> . 

Text t has the following information: 

chapter . "si" € t 
chapter. section. "sl"€ t 

We put a limited length of hypertext reachability and avoid loop of links 
between hypertexts. In this setting, the above-mentioned information structure 
constructs a tree whose node is a hypertext. 

We have some operations on a tree. Let be a display in which every node 
of tree (tree) appears. Node (t), which appears at the center of T>, is called the 
focus, and we denote this fact as focus{tree,t) £ D. A user can change focus 
from t to a distinct node t' . After changing, the updated display D' has the 
information, focus(tree,t') £ D' and focus(tree,t) ^ D' . 

An action to move node (t) at position (x) can be denoted by Tt^x- For two 

displays, (D) and {D'), this action is specified by the mapping, D D' . After 
taking an action (Tt^o) to move a node to the center (o), focus{tree,t) £ D' 
holds in the updated display {D'). Suppose that node (t) has information denoted 
by tags £ t and s appears in display {D). Such a tag in t is called attribute and 
includes the title and the updated time of t. Node t also has some attributes 
such as the fitness of t to a user-entered query, access count and history. Such 
attributes are presented in the display. 



2.2 Features of interactive IR 

Interactive Information Retrieval (HR) integrates structure visualization and 
retrieval of WWW information, providing interactive and incremental access to 
Internet sources. This framework has the following features: 

1. Visualizing the whole structure of hypertexts 

In order to look over a hypertext structure, the structure is displayed as a 
hyperbolic tree where nodes of upper-level texts are put around the center 
and their size is relatively large. In contrast, nodes of lower-level texts are 
distributed at the corners of the display. A user can see a focal node and 
nodes around it easily, and can also take in the larger hierarchical structure 
at a glance. 

2. Smooth change of focus 

Focus change is used to access nodes of lower-level texts. From the viewpoint 
of human cognition, this change is done smoothly, so that there is little gap 
between text transitions. When the focal node is specified by a mouse click, 
several displays (T>,) are created and appear as a sequence D ^ D\ ^ 
T >2 ^ ... ^ D' to show the continuous process of focus change. Mouse 
dragging allows us to arbitrarily change viewpoints, and this provides a direct 
manipulation interface to access nodes easily. 
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3. Attribute- value query on the semi-structured texts 

This extracts information like attributes and keywords from a flat text and 
realizes an attribute- value query. By parsing the text, such information can 
be expressed as follows: 

tcifji .tci(j2 ■ ' ' ' -taQ^i-s G t 

where t is a text, s is a string and tagi is a nested tag. The expression is 
then matched with a user-entered query <tag [, ..., tag'^, s'>. The matching is 
approximated based on a similarity operator ~, and is substituted in the 
following conditions: 

tagi tag[ A ... A tagn r>_' tag'n A S 

This means that we introduce a similarity measure between two strings or 
tags and process a query by computing the fitness of texts to be retrieved 
based on this measure. Retrieved texts are ranked in this way. 

4. Visualizing attribute values for each node 

Attribute values for each node have useful information to navigate users 
to their desired texts. These values are also displayed within a hyperbolic 
tree. Attributes such as the title of the text and the fitness to a query tell 
users which nodes should be accessed. By changing the focus based on these 
attribute values, users may easily access their desired texts. 

5. Filtering texts based on attribute values 

While every node is displayed in the previous feature, this feature eliminates 
meaningless nodes by using a threshold to determine nodes to be displayed. 
Consider node {t) such that every lower-level node {t') of t does not have 
a reasonable attribute value (which is determined by the threshold) . In this 
case, we remove t from the hyperbolic tree, and reduce the tree size. This 
allows users to focus on interesting nodes. 

The above features are interconnected within our HR framework. In particu- 
lar, features (3), (4) and (5) are repeatedly executed. Through such a process, a 
user progressively explores the desired texts. This is a distinctive feature of HR. 

HR accelerates this interactive aspect by cooperating with a browser because 
feature (2) above is inter-related with the browser. After a node is mouse clicked, 
focus is changed, and the corresponding text is displayed in the browser. In the 
same way, when a user browses a text, this text becomes the focus within a 
hyperbolic tree display. This type of interaction helps users to understand which 
portion of hypertext structure is focused on. 



3 WWW Information Access System 

This section describes how the HR framework is realized within WIAS. WIAS 
takes the source sent from a WWW server and access log of the server as input 
information. This information is processed by tag and string parsers, and is used 
in the modules such as structure visualization and query manipulation. We also 
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developed a browser that interacts with the HR modules. All modules except for 
the string parser are implemented in Java language, enhancing the portability 
of WIAS. The visual part of WIAS extends a 2-D hyperbolic tree to a 3-D type 
because node information is represented as the height of the node. A typical 
hyperbolic tree algorithm lays out nodes as a circle, but our algorithm lays them 
out as an ellipse. Thus, front nodes may not hide rear nodes. 



Stmcture of hypertexts 



I'-l 



t 



i 









iitedli* C^nt*r ' 






Pag* titl* InforMtlon Mtdi* C*ot*r 
UPL http //wm iiK tut *c Jp/cont*ntt htiiil 
Ntj*b*r of *cc*tt (Indlvltu*!) 3 




Coiranand nifonnatioii 



Fig. 1. Output of WIAS 



The output of WIAS is shown in Figure 1. A window is decomposed into a 
display for the structure of hypertexts, and areas for entering queries and com- 
mands. Each text is displayed as a node whose height indicates node information 
such as user-access count and fitness to a query. The height is a good indicator 
for efficiently accessing interesting texts. A method of calculating the fitness is 
shown in the next section. 

A hyperbolic tree can be changed arbitrarily by mouse operation. Focus is 
changed by clicking the mouse on a node. Mouse dragging can be accepted at 
any position, making it easy for users to change the viewpoint of the tree. 
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Focus is changed by a distance-preserving transformation on a hyperbolic 
plane. The transformation Tf^x mentioned in Section 2 is distance-preserving. 
Given a mouse click, the position of every node on the hyperbolic plane is cal- 
culated using Tf^o- As for mouse dragging, if a node (t) is at the mouse-pressed 
position, the node moves at the center of the hyperbolic plane and then moves 
at the mouse-released position. This transformation is a composite of and 
Tt^x- 

3.1 Computing the fitness of a text 

A query for text retrieval is represented by the conjunction: 

BiA...ABk 

where Bi{i = l,k) is & literal. A literal is of a specialized form: 



tag\.tag 2 - ■ ■ ■ -tagn-str 



where tagi and str are strings. For each text t, a conjunction B\ A . . . A B^ is 
true if € t is true. The result of the query is a set of texts whenever the 
conjunction is true. 

We then consider an approximated query based on partial matching be- 
tween strings usually used in the information retrieval community. The proposed 
method is based on TFIDF in which a similarity between documents (or strings) 
is defined [Salton 91]. In the following, we briefly introduce TFIDF. 

Let T be a vocabulary list of atomic terms that appear in a set of strings 
si, . . . , Sm being retrieved as texts^ . A string Sp (1 < p < to) is associated with a 
vector Vp € whose element takes a real value. For a term a € T, an element 
of a vector Vp is represented by Vp^a- The value of Vp^a indicates the importance 
of term a with respect to the string associated with Vp^a- If a does not occur in 
Sp, Vp^a is 0. Otherwise, the importance is computed in the following: 

TTi 

Vp,a = log(TF„-^,a -I- 1) • log(— ) 

where is the number of times that term a occurs in the string Sp, and Ca 

is the total number of strings that contain the term . 

The similarity between strings Sp and Sq can be computed using this vector 
representation of strings. Let Vp and Vq be vector representations of the strings 
Sp and Sq, respectively. The similarity is defined as follows: 



1 

2 



sim(sp, Sq) = 






WIAS uses the Japanese morphological analysis system, Chasen [Matsumoto 99], to 
obtain atomic terms that are noun in a dictionary. 

Although Vp is very long in a matrix notation, this is quite sparse. Thus, efficient 
implementation is possible. 
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We use the above TFIDF framework to compute the fitness of a text to a 
user-entered query. Consider an atomic query B = tag\. • • • .tagn-str . If there 
is an attribute-value pair Ci = tag[. ■ ■ ■ .tag'^.str' € t in text t, the similarity 
between the pair and the query is computed in the following: 



sim(Ci , B) 



n 

i=i 



sim(tagj,tag'j) x sim(str, str') 



We then define the fitness FIT{t, B) of the text t to the query B as follows: 



FIT{t,B) = max sim{Ci, B) 

This means that the most similar attribute- value pair is selected to compute the 
fitness. 

Finally, the fitness of text to conjunctive query Bi A . . . A B^ is defined in 
the following: 



FIT{t,Bi,...,Bk) = n FIT{t,Bi) 

Since the similarity measure in the TFIDF framework is between zero and one, 
the fitness is also between zero and one. In our HR framework, fitness is displayed 
as the height of a node within a hyperbolic tree. 



3.2 Reducing the tree 

Filtering uninteresting nodes is the most important feature of WIAS. Given a 
query, the fitness to the query for each node is computed and is displayed as the 
height of the node. The filtering function then removes nodes that have lower 
fitness, and restructures a reduced hyperbolic tree. This is very useful for large 
web sites because users can focus on interesting texts only. 

Figure 2 shows a filtering process in WIAS. The left figure is a hyperbolic 
tree of our university research division web site (Science University of Tokyo, 
Information Media Center, http://www.imc.sut.ac.jp/) consisting of 226 texts. 
Since the height of each node indicates the fitness of the associated text to a 
user-supplied query, higher nodes are interesting for the user. There are nine 
nodes among a large number of nodes in the figure. 

The right figure shows a reduced hyperbolic tree. Even uninteresting nodes 
that are reachable to interesting nodes still exits in the hyperbolic tree. The 
reduced tree constructs a web site for the user and allows the user to see the 
manageable-sized web site. Although existing search engines list up pages that 
are interesting for users, it is impossible to see the relationships between the 
searched pages. In contrast, a hyperbolic tree representation allows users to 
capture the web page structure and makes it easy to find interesting portal sites. 
Moreover, some queries can be put incrementally, and more interesting pages 
can be explored. A combinatorial use of information retrieval and visualization 
supports processes in knowledge discovery. 
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(b) After filtering 



Fig. 2. Filtering process 



4 Experiments 

We evaluate WIAS in the following points: 

Accuracy User-desirable texts should be accessed in an exhaustive manner. It 
is preferable that there is no missing texts. 

Efficiency Cost to reach desirable texts should be low. Uninteresting texts 
should not be accessed. 

These properties tend to be mutually exclusive, and thus well-balanced informa- 
tion access is needed. Based on the above points, two experiments were conducted 
to show the effectiveness of hyperbolic tree visualization and filtering. 

4.1 Effectiveness of visualization 

A web site we used is our university research division site mentioned in Section 
3.2. We selected as subjects ten bachelors belonging to our university who did 
not access the site before experiments. We divided the subjects into two groups; 
one uses a web browser with hyperbolic tree visualization and the other use the 
browser only. We gave each subject to 10 problems in which a hint is put as 
an abstract of a text to be found, and recorded passage time to access all texts 
and operation history. A problem was given, for example, “Search a text where 
you can see the picture of XXX professor in YYY workshop presentation” . Since 
sufficient keywords were given, the subjects could find target texts correctly. Note 
that the possible operations the subjects used were mouse click for hyperlink 
selection, page back and forward, home position and bookmark registration. 

Figure 3 shows the number of texts the subjects found. The graph indicates 
that use of hyperbolic tree visualization does not contribute efficient information 
access at an early stage, but achieves efficiency when 15 minutes passed. This 
means that global view function of hyperbolic tree visualization can assist users 
to find their desirable texts efficiently when a target web site becomes larger. 
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Fig. 3. Browsing performance with and without use of hyperbolic tree visualization 
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Fig. 4. Title reference in browsing 



Checking subjects’ histories about mouse operations is needed to clarify what 
is going on in browsing texts. WIAS allows user to specify a node by mouse move 
and to see the associated text title. Figure 4 shows how much times subjects see 
text titles in browsing about 50 texts. The total times was over 100 titles to 
see all the texts. At the beginning, the times increased dramatically, but were 
slightly changed after ten texts were found. This indicates that users saw the 
overall structure of the web site at the beginning and then found texts to be 
accessed. As a conclusion, the structure visualization based on the hyperbolic 
tree provides an efficient and exhaustive browsing function. 
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4.2 Effectiveness of filtering 

The second experiment shows the advantage of hypertext filtering by combining 
visualization and query-based retrieval. The web site we selected is our university 
site (http://www.sut.ac.jp) consisting of 3000 texts. This site supports search 
engine InfoSeek that is used to compare the HR framework with mixed use 
of browsing and search engine. Since there are quite a few texts in the site, 
it is difficult to display the whole structure of the texts in a hyperbolic tree 
style. This means that a user must pose queries and filter a number of texts 
incrementally. We also have ten subjects that are divided into two groups. Group 
1 subjects use a browser and InfoSeek, and the others use query-based retrieval 
and filtering of WIAS functions. The problem given to the subjects is to find 
five research laboratory documents related to concepts such as “environment” 
and “automobile.” The subjects posed some keywords and repeatedly filtered 
the hypertexts. 




0 40 en flo 100 i?o 140 



Nurrber of mouse clidt times ^ ^ 

(S: Search F: Found) 



Fig. 5. Browsing process without use of WIAS 




Fig. 6. Browsing process with use of WIAS 



Figures 5 and 6 show the experiment results. The average times to find texts 
that correspond to the portal sites of the five research laboratories are 23 minutes 
without using WIAS and 12 minutes with using WIAS. In the figures, character 
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“S” indicates that a subject posed a query, and character “F” with an arrow 
indicates that a subject found a text. The two graphs allow us to see the ad- 
vantage of using WIAS. Specifically, the number of queries posed is quite small 
for WIAS use, and the number of mouse clicks is half that for normal browsing 
and keyword searching. This is because WIAS supports visualization of texts 
including keywords in a hierarchical manner, and thus it is easy for the subjects 
to browse and find upper-level nodes that are portal site pages of research lab- 
oratories. This result exemplifies the advantage of integrating visualization and 
retrieval in the HR framework that provides a significant insight as a supportive 
tool for discovering WWW information. 

5 Comparison with other work 

Ashish et al. [Ashish 97] and Atzeni et al. [Atzeni 97] have attempted to structure 
WWW information to support SQL-like queries. Their approach is to construct 
a database from multiple web sites using “wrapper” programs that deal with 
semi-structured information. Our framework does not compete with theirs but 
may exploit it in generating attribute-value pairs from WWW information. The 
difference is that a query in WIAS is relatively simple because WIAS just focuses 
on retrieving desired texts. Thus, we do not necessarily deal with a general re- 
lational model. This increases the efficiency of information retrieval by adopting 
special data structures. 

We suggest query manipulation based on TFIDF that was developed by the 
information retrieval community. Such an approximate approach has not been 
taken in database research literature. Recently, Cohen introduced a “soft” join 
operation like relational database framework [Cohen 99]. This extends the join 
using TFIDF in order to provide query manipulation suitable for WWW infor- 
mation. Data and the query are expressed by first-order logic, and information is 
retrieved by variable binding. This means that a soft join has greater expressive 
power than our HR framework. However, visualization is not considered or such 
a first-order representation is not exploited within WWW structure visualiza- 
tion. Our method focuses on text retrieval only, and text fitness to a query can 
be displayed within structured visualization. 

Visualization methods for WWW information are recently proposed within 
a discovery science community [Hirokawa 98] [Sawai 98] [Shibayama 98] . How- 
ever, integration with information retrieval is not realized. In our approach, a 
web browser was specially designed and implemented to communicate with vi- 
sualization and retrieval, showing the performance of this integrated approach. 

Munzner designed an information space in which multiple hyperbolic trees 
are configured three dimensionally [Munzner 98]. However, there are a number 
of nodes on the information space map, and it is hard to focus on the appropriate 
portion of a web site. An alternative visualization scheme was developed as a 
“cone” tree by Robertson [Robertson 91]. In a cone tree, nodes are distributed in 
a three-dimensional space, and thus front nodes may hide rear nodes. Moreover, 
focus change is not as easy as in a hyperbolic tree representation. 




Integrating Information Visualization Retrieval 



49 



6 Conclusions 

In this paper, we proposed an integration of structure visualization and retrieval 
of WWW information. The resulting system, WIAS, consists of hyperbolic tree 
visualization and attribute-value pair query manipulation, and provides a filter- 
ing function to reduce the size of the WWW information structure. Experiments 
were conducted to show the advantage of the interactive information retrieval 
framework in WIAS. The obtained statistics demonstrate the effectiveness of 
WIAS in accessing WWW information. Since WIAS can be interconnected with 
our browser, the proposed framework will help users browse a large number of 
texts on the Internet. 
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Abstract. The number, the size, and the dynamics of Internet informa- 
tion sources bears abundant evidence of the need for automation in infor- 
mation extraction. This calls for representation formalisms that match 
the World Wide Web reality and for learning approaches and learnability 
results that apply to these formalisms. 

The concept of elementary formal systems is appropriately generalized to 
allow for the representation of wrapper classes which are relevant to the 
description of Internet sources in HTML format. Related learning results 
prove that those wrappers are automatically learnable from examples. 
This is setting the stage for information extraction from the Internet by 
exploitation of inductive learning techniques. 



1 Motivation 

Today’s online access to millions or even billions of documents in the World Wide 
Web is a great challenge to research areas related to knowledge discovery and 
information extraction (IE). The general task of IE is to locate specific pieces of 
text in a natural language document. 

The authors’ approach draws advantage from the fact that all documents 
prepared for the Internet in HTML, in XML or in any other possibly forthcoming 
syntax have to be interpreted by browsers sitting anywhere in the World Wide 
Web. For this purpose, the documents do need to contain syntactic expressions 
which are controlling its interpretation including its visual appearance and its 
interactive behaviour. In HTML, these are the text formatting and annotating 
strings (tags), and in fXTf^X, for instance, there are numerous commands. The 
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document’s contents is embedded into those syntactic expressions which are 
usually hidden from the user. 

The user is typically not interested in the varying syntactic tricks invoked for 
information presentation, but in the information itself. Accordingly, the present 
approach assumes that the user deals exclusively with the desired contents, 
whereas a system for IE should deal with the syntax. 

In a characteristic scenario of system-supported IE, the user is taking a source 
document and is highlighting representative pieces of information (s)he is inter- 
ested in. It is assumed that the user’s view at a certain document, which might 
evolve gradually or which might even change over time, can be represented as a 
certain relation of text strings from the underlying source. Thus, the user’s input 
are just a few sample instances from the relation (s)he is seeing when looking at 
the given source document. 

It is left to the system to understand how the target information is wrapped 
into syntactic expressions. This is a first learning task posed to the system. 

Next, the system has to generalize the wrapper concept hypothesized for 
coming up with an extraction procedure. Applied to the given source document, 
this extraction mechanism generates a certain hypothetical relation which is 
returned to the user. The step of generalization mentioned is a second learning 
task to be solved by the system. 

In response to providing a few samples illustrating the user’s view at the 
source document, (s)he receives a list of extracted tuples. The user may compare 
the system’s guess to the results aimed at and, in dependence on the comparison’s 
outcome, complain about the result - if necessary - by indicating erroneously 
extracted tuples or by supplementing those tuples missing in the systems output. 

When the user is returning this information to the system, a new cycle of 
learning and information extraction is initiated. Several further interactions may 
follow, and the extraction mechanism generated may be applied to further source 
documents. Several cycles of interaction will improve the IE results. 

From a theoretical perspective, the system is performing a two-level learning 
process based on particular positive and negative examples provided by the user. 

2 Introduction 

Let us explain this approach with the help of the bibliography of this paper. 
Imagine, the list of referenced papers is presented in a semi-structured form, 
namely HTML, as follows: 

(html) { hody)(hl)References(/hl) 

{ol){li)D. Angluin, ’Inductive inference of formal languages from positive data’, 
(i)Information and Control{/i), {b)45{/b), 117-135, (1980). {/li) 

{li)D. Angluin and C.H. Smith, ’A survey of inductive inference: Theory 
and methods’, (i) Computing Surveys{/i) , (b) 15{/b) , 237-269, (1983). {/li) 

{li)S. Arikawa, ... 



Figure 1. 
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In order to extract information, we can assnme special text segments to be 
delimiters (henceforth called anchors) marking the beginning and the end of the 
relevant information to be extracted. 

For instance, to extract the anthors of a pnblication, for example, the text 
segments [li) resp. , ’ are interpreted as the left and the right anchor for the 
anthors’ name(s). The year of pnblication (more exactly, its last two digits) can 
then be identified by the left anchor (19 and the right anchor Using these 

anchors, the extraction of all anthors’ names and the relevant years of pnblication 
from the bibliography represented as HTML docnment is straightforward. One 
first tries to find the left and right anchor for the anthors’ name(s) and then, to 
the right, the nearest left and right anchor for the year of pnblication. This is 
repeated as long as those pairs of anchors can be fonnd. 

Several approaches [ 8 , 14 , 15 ] in the IE commnnity nse this basic idea to define 
extraction procednres (wrappers or templates) based on their own description 
langnage. Fnrther, investigations showed that wrappers can be classified accord- 
ing to their expressiveness based on several structural constraints. This leads 
to the general qnestion whether or not snch regnlarly strnctnred descriptions 
can antomatically be constrncted and fnrthermore learned. A rnle based way to 
define a wrapper for the previonsly discnssed example is presented in Fignre 2 . 

Here, by capital letters we denote variables, terminal symbols are typeset in 
italics. The first rnle can be interpreted as follows: An anthor A and the year of 
pnblication Y can be extracted from a docnment D in case that (i) D matches 
the pattern X1L1AR1X2L2YR2X3 and (ii) the instantiations of the variables meet 
certain constraints. For example, the constraint ri (Ri ) states that the variable 
Ri can only be replaced by the string , ’ or the string ,{i). Fnrther constraints like 
nc-ri(A) explicitly state which text segments are not snited to be snbstitnted 
for the variable A (cf. rnles 2-5 and 15 - 16 ). In this particnlar case, text segments 
that contain the snbstrings , ’ or ,{i) are not allowed. If a docnment D matches 
the pattern X1L1AR1X2L2YR2X3 and if all specified constraints are fnlfilled, then 
the instantiations of the variables A and Y yield the information reqnired. 

1 extract (A ,Y ,XiLiARiX 2L2YR2X3) <— li (Li) , nc-ri (A) , ri(Ri), 

nc-l2(X2) , 

12(12), nc-r2(Y), r2(R2). 

2 nc-ri (X) <— not c-ri (X) . 

3 c-ri (X) <— ri (X) . 4 C-ri(XY) <— c-ri (X) . 5 C-ri(XY) <— c-ri(Y). 

6nc-l2(X) <— not c-l2(X). 

7 C-l 2 (X) ^ l2(X). 8C-l2(XY) ^ c-l2(X). 9 C-l 2 (XY) ^ C-I2 (Y) . 

ionc-r2(X) <— not c-r2(X). 

iic-r2(X) <— T2 (X) . i2C-r2(XY) <— c-r2(X). i 3 C-r 2 (XY) <— c-r2(Y). 

14 ll((li}). 

15 n(,’). 

16 n (,(i)) . 

18 I2 ()•(/(*}) • 



Figure 2. 
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The aim of this paper is twofold. We propose a uniform way to represent 
wrappers, namely advanced elementary formal systems (AEFSs, for short). This 
concept is generalizing Smullyans [12] elementary formal systems (EFSs) which 
have thoroughly been studied in different respects. We investigate the expressive 
power of AEFSs and show that it is sufficient for wrapper representation though 
its semantics is still reasonable. To lay a cornerstone of our application-oriented 
work, we investigate which learnability results achieved for EFSs lift to AEFSs. 
Additionally, we prototypically show how AEFSs can be used to describe a cer- 
tain class of HTML wrappers, so-called island wrappers (cf. [14,15]). We prove 
the learnability of island wrappers from only positive examples under certain 
natural assumptions. 

3 Advanced Elementary Formal Systems 

In this section, we introduce a quite general formalism to describe wrappers, 
namely advanced elementary formal systems (AEFSs, for short). In addition, we 
study the expressive power of AEFSs and deal with the question of whether or 
not AEFSs can be learned from examples. 

AEFSs generalize Smullyan’s [12] elementary formal systems (EFSs, for 
short) which he introduced to develop his theory of recursive functions. In the 
last years, the learnability of EFSs has intensively been studied within several 
formal frameworks (cf. [4, 16, 3, 5, 11, 17, 10]). 



3.1 Elementary Formal Systems 

Next, we provide notions and notations that allow for a formalization of EFSs. 

Assume three mutually disjoint sets - a finite set S of characters, a finite 
set n of predicates, and an enumerable set X of variables. We call every element 
in (i7U A)"*" a pattern and every string in A"*" a ground pattern. For a pattern tt, 
we let n(7r) be the set of variables in tt. 

Let p G IZ be a predicate of arity n and let 7ri,...,7r„ be patterns. Let 
A = p(7Ti, . . . , 7T„). Then, A is said to be an atomic formula (an atom, for short). 
A is ground, if all the patterns tt* are ground. Moreover, v[A) denotes the set of 
variables in A. 

Let A and Bi, . . . , Bn be atoms. Then, r = A<— Bi, . . . , Bn is a rule, A is the 
head of r, and all the B* form the body of r. Then, r is a ground rule, if all atoms 
in r are ground. Additionally, if n = 0, then r is called a fact and sometimes we 
write A instead of A<— . 

Let cr be a non-erasing substitution, i.e., a mapping from X to B"*". For 
any pattern tt, ttct is the pattern which one obtains when applying cr to tt. Let 
C = p(7Ti, . . . , 7T„) be an atom and let r = A<— Bi, . . . , Bn be a rule. Then, we 
set Ccr = p('Kicr, . . . , 7r„cr) and rcr = Acr<— Bicr, . . . , B„cr. If rcr is ground, then it 
is said to be a ground instance of r. 

Definition 1 ([5]) Let X , n , and X he fixed, and let F be a finite set of rules. 
Then, S = (B, II, F) is said to be an EFS. 
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EFSs can be considered as particnlar logic programs withont negation. There 
are two major differences: (i) patterns play the role of terms and (ii) nnification 
has to be realized modnlo the eqnational theory 

E = {o(®, o{y, z)) = o(o(®, y), z)}, 
where o is interpreted as concatenation of patterns. 

As for logic programs (cf., e.g., [9]), the semantics of an EPS S can be de- 
fined via the following operator Ts- In the corresponding definition, we nse the 
following notations. For any EES S = (E, II, E), B(S) denotes the set of all 
well-formed gronnd atoms over E and II , and G(S) denotes the set of all gronnd 
instances of rnles in F. 

Definition 2 Let S be an EFS and let I C B(S). Then, we let Ts{I) = I U {A | 
A<— Bi , . . .,Bn e G{S)for some Bi G G /}. 

Note that, by definition, the operator Ts is idempotent and monotonic. 

As nsnal, we let = Ts{Tg(I)), where Tg(I) = I, by convention. 

Definition 3 Let S be an EFS. Then, we let Sem(S) = Un6iN^.s(®)' 

In general, Sem(S) is semi-decidable, bnt not decidable. However, as we will 
see below, Sem(S) tnrns ont to be decidable in case that S meets several natnral 
syntactical constraints (cf. Theorem 5). 

Finally, by £ we denote the collection of all EFSs. 

When nsing EFSs to describe wrappers, a problem that nsnally occnrs is that 
one has to pnt constraints on the patterns that form admissible snbstitntions for 
particnlar variables (cf., e.g., the wrapper in Fignre 2). One approach to cope 
with this problem is to explicitly describe the admissible patterns. In some cases, 
it is more convenient to explicitly describe the exceptions and to postnlate that 
every pattern that does not serve as exception is admissible. In contrast to EFSs, 
AEFSs provide enongh flexibility to realize the latter approach, as well. 



3.2 Beyond Elementary Formal Systems 

Informally speaking, an AEFS is an EFS that may additionally contain rnles 
of the form A<— not Bi, where A and Bi are atoms and not stands for nega- 
tion as finite failnre. The nnderlying meaning is as follows. If, for instance, 
A = p(xi, . . . , Xn) and Bi = g(®i, . . . , ®„), then the predicate p sncceeds iff 
the predicate q fails. 

However, taking the conceptnal difRcnlties into consideration that occnr 
when defining the semantics of logic programs with negation as finite failnre 
(cf., e.g., [9]), AEFSs are constrained to meet several additional syntactic re- 
qnirements (cf. Definition 4). The reqnirements posed gnarantee that, similarly 
to stratified logic programs (cf., e.g., [9]), the semantics of AEFSs can easily 
be described. Moreover, as a side-effect, the reqnirements posed gnarantee that 
AEFSs inherit some of the convenient properties of EFSs (cf., e.g.. Theorem 5). 

Before formally defining how AEFSs look like, we need some more notations. 
Let T be a set of rnles (inclnding rnles of the form A<— not Bi). Then, hp[F) 
denotes the set of predicates that appear in the head of any rnle in T. 
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Definition 4 AEFSs and their semantics are defined according to the following 
rules. Let S' = (S, II' , F') be an EFS. Moreover, let S\ = (E, Hi, Fi) and 
S 2 = {E, H 2 , F 2 ) be AEFSs. 

1. S = S' is an AEFS. Moreover, we let Sem(S) = Sem(S'). 

2. If Hi n II 2 = 0 , then S = (E, IIi U II 2 , F\ U F 2 ) is an AEFS. Moreover, we 
let Sem[S) = Sem(Si) U Sem[S 2 ). 

3. Let p ^ Hi and q G Hi be predicates of arity n. Then, S = [E, LIi U M, AU 
{p(®i, . . . , Xn)<^not q(xi, . . . , Xn)}) is an AEFS. Moreover, we let Sem(S) = 
5em(5i)u{p(si,...,s„) | si G E+ , . . . , s^ G i7+ , g(si, . . . , s„) ^ 5em(5i)}. 

4 . Let hp[F')nIIi = $. Then, S = [E, n' U III, F'U Fi) is an AEFS. Moreover, 
we let Sem{S) = U„gj[^T^,(5em(5i)). 

Finally, by A£ we denote the collection of all AEFSs. 

Having a closer look at Figure 2, one realizes that AEFSs can be used to 
describe interesting wrappers. 



3.3 Expressiveness 

In the following, we show how AEFSs can be used to describe formal languages 
and relate the resulting language classes to the language classes of the classical 
Chomsky hierarchy (cf. [7]). Although we are mainly interested in using AEFSs 
for describing wrappers, we strongly believe that the established relations are 
quite helpful to better understand the expressive power of AEFSs. 

Definition 5 Let S = [E, LI, F) he an AEFS and let p ^ IF be a unary predicate. 
We let L[S,p) = {s | p[s) G 5em(5)}. 

Intuitively speaking, L[S, p) is the language which the AEFS S defines via 
the unary predicate p. 

Definition 6 Let M C A.£ . Then, the set C(^M) of all languages that are defin- 
able with AEFSs in M contains every language L for which there are an AEFS 
S = [E, LI,F) in M and some unary predicate p ^ LI such that L = L[S,p). 

For example, C[A£) is the class of all languages that are definable by AEFSs. 
Our first result puts the expressive power of AEFSs into the right perspective. 
Let Cr.e. be the class of all recursively enumerable languages. 

Theorem 1 Cr.e. C C{AS). 

Moreover, the following closedness properties can be shown. 

Theorem 2 C[A£) is closed under the operations union, intersection, comple- 
ment, and concatenation. 

To elaborate a more accurate picture, similarly to [3], we next introduce 
several constraints on the structure of the rules an EFS resp. AEFS may contain. 

Let r be a rule of form A<— not Bi or A<— Bi, . . . , Then, r is said to 
be variable-bounded iff, for all i < n, v[Bi) C v[A). Moreover, r is said to be 
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length-bounded iff, for all substitutions cr, \A(t\ > |Bicr| |B„cr|. Clearly, if 

r is length-bounded, then r is also variable-bounded. Note that, in general, the 
opposite does not hold. Finally, r is said to be simple iff A is of form p(7r) and 
7T is a pattern in which every variable occurs at most once. 

Definition 7 Let S = be an AEFS. Then, S is said to be variable- 

bounded iff all r ^ r are variable-bounded. 

Moreover, S is said to be length-bounded iff all r ^ T are length-bounded. 

Next, S is said to be regular iff II contains exclusively unary predicates and all 
r ^ r are length-bounded as well as simple. 

Finally, by vb-A£ ( vb-£ ), lb-A£ (lb-£ ), and reg-A£ ( reg-£ ) we denote the col- 
lection of all variable-bounded, length-bounded, and regular AEFSs (EFSs). 

Now, similarly to Theorem 2, the following result can be established. 

Theorem 3 C[vb-A£), C{lb-A£), and C[reg-A£) are closed under the opera- 
tions union, intersection, complement, and concatenation. 

The next theorem summarizes the announced relations to some important 
language classes of the Chomsky hierarchy (cf. [7]). Here, Ccs and Ccf denote 
the class of all context sensitive and context free languages, respectively. 

Theorem 4 

1. Cr.e. C C{vb-AS). 

2. Ccs = C{lb-A£). 

3. Ccf C c\reg-A£) C Ccs- 

Assertion (2) of the latter theorem allows for the following insight: 

Theorem 5 7/5 C lb-A£, then Sem(S) is decidable. 

Note that, for length-bounded EFSs, the equivalent of Theorem 5 has already 
been shown in [5]. Moreover, for EFSs, Theorem 4 rewrites as follows: 

Theorem 6 ([5]) 

1. Cr.e. = C[vb-£). 

2. Ccs =C{lb-£). 

3. Ccf = C[reg-£). 



3.4 Learnability 

At the end of this section, we present some basic results concerning the question 
of whether or not AEFSs can be learned from examples. To be more precise, we 
study the learnability of several language classes that are definable by AEFSs 
within the learning model introduced by Gold [6]. As we shall see, the results 
obtained provide a firm basis to derive answers to the question to what extent, 
if ever, HTML wrappers can automatically be synthesized from examples. 
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Let us briefly review the necessary basic concepts. We refer the reader to [2] 
and [18] which contain all missing details concerning Gold’s [6] model. 

There are several ways to present information about formal languages to be 
learned. The basic approaches are defined via the concept text and informant, 
respectively. A text is just any sequence of words exhausting the target language. 
An informant is any sequence of words labelled alternatively either by 1 or 0 such 
that all the words labelled by 1 form a text whereas the remaining words labelled 
by 0 constitute a text of the complement of the target language. 

Now, an algorithmic learner (henceforth, called IIM) receives as its inputs 
larger and larger initial segments of a text t [an informant i] for a target lan- 
guage L and generates as its outputs hypotheses. In our setting, an IIM is sup- 
posed to generate AEFSs resp. EFSs as hypotheses. An IIM learns a target 
language L from text t [informant i] , if the sequence of its outputs stabilizes on 
an AEFS which correctly describes L. Now, an IIM is said to learn L from text 
[informant], if it learns L from every text [every informant] for it. Furthermore, 
some language class C is said to be learnable from text [informant] , if there is an 
IIM which learns every language L G C from text [informant]. 

Next, we summarize the established learnability and non-learnability results. 

Theorem 7 

1. C[vh-A£) is not learnahle from informant. 

2. C{lh-A£) is learnahle from informant. 

Proof. By Theorem 4, C^.e. Q C(vb-A£). Since there is no learning algorithm 
which is capable to learn the class C^. of all recursive languages from informant 
(cf. [6]), we obtain (1) because of£,. C C^.e.- Furthermore, since C{lh-A£) = Ccf-, 
we know that C{lh-A£) constitutes an effectively enumerable class of recursive 
languages. Hence, we get (2), since every effectively enumerable class of recursive 
languages is learnable from informant (cf. [6]). □ 

Based on weaker information, if exclusively positive examples are available, 
only relatively small language classes turn out to be learnable at all. Next, for 
all fc > 1, we let lb-A£^ (lb-£^) denote the collection of all AEFSs (EFSs) which 
consists of at most k rules. 

Theorem 8 

1. For all k > 2, C{lb-A£^) is not learnable from text. 

2. £[lb-AS^) is learnable from text. 

Proof. Since, by definition, £[lb-AS^) = £{lb-£^), (2) is a special case of 
Assertion (2) in Theorem 9. It remains to verify (1). 

Let S = {o}, let Lq = {a}"*" and, for all j > 1, let Lj = Lq \ {«•’}• It is 
folklore that there is no learning algorithm that is able to learn all languages 
in C = {Lj I j G IN} from positive data (cf., e.g., [18], for the relevant details). 
It suffices to show that C C £[lb-AS^). This can be seen as follows. In case of 
j = 0, let 5o = (i7, {p}, {p(®)^ }). Clearly, L{So,p) = Lq. Next, let j > 1. 
Then, Sj = {LJ,{p,q},{q{a^)<—, p(x)<—not q(®)}). Clearly, L[Sj,p) = Lj. □ 
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For EFSs, the equivalent of Theorem 7 holds, as well. In contrast, Theorem 8 
rewrites as follows. 

Theorem 9 ([11]) 

1. C{lh-£) is not learnahle from text. 

2. For all k > 1, jC[lb-£^) is learnahle from text. 



4 Wrappers for HTML Documents 

Next, we demonstrate that AEFSs provide an appropriate framework to de- 
scribe wrappers of practical relevance. Henceforth, the corresponding wrappers 
are called island wrappers. As a main result, we show under which assumptions 
learning techniques can be invoked to automatically generate island wrappers 
from positive examples. 

Semi-structured documents carry different information and, moreover, the 
relevance of the information naturally depends on the users’ perspective. As 
in most IE approaches, we assume that the content of a semi-structured docu- 
ment D (more formally, its semantics) is a set of tuples which D contains. For 
example, the tuples (D . Angluin, 80), (D . Angluin and C.H. Smith, 83), ... 
form relevant information in the list of references of the present paper (cf. Fig- 
ure 1). Now, the aim of the IE task is to provide a wrapper which allows one 
to extract all tuples of this kind from this document as well as from all similar 
documents. In contrast to rather traditional approaches to IE in which it is the 
users’ task to construct the relevant wrappers, we are interested in algorithms 
that automatically synthesize appropriate wrappers from examples. 



4.1 Semantics of HTML Documents 

Let D G be a document. The information which D contains is a finite 
set of tuples of strings (si, . . ., s„) where, as a rule, each of these strings must 
occur in D. Together with a tuple (si, . . . , s„), it is important to know to which 
subword in D a string Si is referring to. For instance, consider the tuple (si, S 2 ) = 
(B. Thomas, 99) in the list of references. Then, it might be intended that (si, S 2 ) 
has its origin either completely in reference [14] or completely in [15]. It is rather 
unlikely that belongs to [14] and that S 2 belongs to [15]. 

More formally speaking, the semantics of a document D is more than a set 
of tuples that describe the information contained. The semantics of D depends 
on an interpretation I which relates the strings in the tuples to subwords in D. 
More formally speaking, a function S : ^ p((i7"*")") is a semantics iff there 

is an interpretation I such that for all D G and for all (si, . . . , s„) G S(D) 
the Conditions (1) and (2) are fulfilled, where 

(1) !{{si , . . . , Sn), D) describes at which positions the Si begin in D. 

(2) Sj_|_i begins in D after Si ends in D. 
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Intuitively speaking, if examples are provided that illustrate a certain seman- 
tics S, a learning algorithm is supposed to learn a wrapper that implements S, 
i.e., the wrapper must allow for the extraction of all tuples in S(D) for any 
document D. 

4.2 Island Wrappers 

Island wrappers generalize the structure of the wrapper that has been designed 
to extract certain information from the HTML representation of the list of ref- 
erences (cf. Figure 2). 

Island wrappers are AEFSs that consist of several basic length-bounded 
AEFSs describing anchor languages and of a couple of fixed top level rules that 
determine the interplay between these basic AEFSs (cf. Figure 3). To be more 
precise, consider an island wrapper that is designed to extract n-tuples from a 
given HTML document. For every extraction variable Vi, there have to be basic 
AEFSs Si^ and that define the left and right anchor languages Li^ and 
via the unary predicates p^^ andp^.^, i.e., L(5^^,p^J = Li^ and 
As a rule, it is assumed that the set of predicate symbols used in the basic 
AEFSs Si^ and have to be mutually disjoint. Now, intuitively, a string Si 
can be substituted for the extraction variable Vi only in case that the actual 
document D contains a substring Ui o Si o Wi that meets Ui G Li. and Wi G 
Thus, the anchor languages put constraints on the surroundings in which the 
relevant strings Si are embedded in D. Moreover, as argued at the beginning 
of the last subsection, further minimality constraints are necessary. The strings 
substituted for the extraction variables must be as short as possible, while the 
distance between them has to be as small as possible. The top level rules needed 
are depicted in Figure 3. Let w, c-p^^, nc-p^^, c-p^^, nc-p^^, . . ., c-p^^, nc-p^^, 
c-p^^, and nc-p^^ be predicates not occurring in all the basic AEFSs Si. and 5^... 

iw(Vi,V2 V„,XiLiViRiX 2L2V2R2 ••• X„L»V„R„X»+i) ^ 

pij(Li), nc-pn(Vi), Pn(Ri), 
nC-pi2(X2), pi2(L2), nC-pr2(V2), Pr2(R2), 

nC“P^,j (Xtj) , pin ) HC“P7.,J (Vtj) , . 

2 nc-pij (X) <— not c-pij (X) . 

3 c-pij (X) <— pij (X) . 4C-pij(XY) <— c-pij (X) . 5C-pij(XY) <— c-pij (Y) . 



6 nc-pr„ (X) <— not c-pr„ (X) . 

7 c-pr„ (X) <— pr„ (X) . sc-pr„(XY) <— c-pr„ (X) . 9C-pr„(XY) <— c-pr„(Y). 

Figure 3. 

Formally speaking, an island wrapper is an AEFS = (S, II, F), where II 
is the collection of all predicates that occur either in rules in Figure 3 or in rules 
belonging to the basic AEFSs Si^ and 5,.^ and F contains all rules in Figure 3 
and all rules in the basic AEFSs Si^ and 5,.^. As a matter of fact, note that every 
Siyj is length-bounded. Hence, by Theorem 5, Sem[Siyj) is decidable. The latter 
makes island wrappers particularly tailored for IE. 
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In order to use an island wrapper Si^ to extract information from a doc- 
ument D, all the n-tuples (si,...,s„) that meet w(si, . . . , Sn, D) G Sem(Siw) 
have to be computed. Since all the s* are substrings of D, this can be done 
effectively. An island wrapper Si^, defines a particular view at the document 
D e i7+, namely view{Siy,,D) = {(si, . . . , s„) | in(si , . . .,Sn,D) G Sem{Siy,)}. 
Furthermore, the island wrapper Si^ implements a particular semantics S iff 
view(Siy, , D) = S(D). 

Having the non-learnability results from Subsection 2.4 in mind its is unre- 
alistic to assume that the class of all island wrappers is learnable from positive 
examples. For a better understanding of the principal power and limitations of 
the learning approach under consideration, we provide some finer look at is- 
land wrappers by putting some natural constraints on the admissible anchor 
languages. By A£iw we denote the collection of all length-bounded island wrap- 
pers. For all k G IN, is the set of all island wrappers in A£iw that are built 

upon anchor languages L with card(L) < k. 



4.3 Learning Island Wrappers from Marked Text 

Now, we are ready to study the question under which assumptions island wrap- 
pers can be learned from positive examples. As defined above, island wrappers 
differ in their anchor languages only. Hence, the overall learning task reduces to 
the problem to find AEFSs which describe the relevant anchor languages. 

However, a potential user does not provide elements of the anchor languages 
to the system. Instead, the user marks interesting information in the HTML doc- 
uments under inspection. To illustrate this, consider again the semi-structured 
representation of the list of references. In Figure 4, the information the user is 
interested in is underlined. 

(html) { hody)(hl)References(/hl) 

{ol){li)D. Angluin, ’Inductive inference of formal languages from positive data’, 
(i)Information and Control{/i), {b)45{/b), 117-135, (1980). {/li) 

{li)D. Angluin and C.H. Smith, ’A survey of inductive inference: Theory 
and methods’, (i) Computing Surveys{/i) , (b) 15{/b) , 237-269, (1983). {/li) 

{li)S. Arikawa, ... 

Figure 4. 

Clearly, the marked document provides only implicit information concerning 
the anchor languages to be learned. For instance, it can easily be deduced that the 
left anchor language contains a string that forms a suffix of the text segment 
{htmt){body){hl)References{/hl){ot){li) and a (possibly) different string that is 
a suffix of the text segment .{/li){li). Moreover, the right anchor language 
must contain at least one string that is a prefix of the text segment .{/li){li). 

To be precise, the learner does not receive a text for any of the relevant 
anchor languages as input. Therefore, the results from Subsection 2.4 do only 
translate indirectly - after an appropriate adaptation - to our setting of learning 
wrappers from marked texts. 
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Let 5 be a semantics under some fixed interpretation I. A marked text t for 
S is any sequence of pairs (D, P) fulfilling the following conditions: 

(1) Each P is a document from P"*". 

(2) P is a finite set of pairs (s,p), where s = (si, . . . , s„) is an n-tuple in S(D) 
and p = (pi, . . .,Pn) describes where the s* begin in D, i.e. p = I(s, D). 

(3) The sequence is exhaustive, i.e., for every document D G P"*", every n-tuple 
in S(D) eventually appears in t. 

Now, a wrapper learner (henceforth called WIM) receives as input larger and 
larger initial segments of a marked text for some target semantics and generates 
as outputs AEFSs describing wrappers. A WIM is said to learn a semantics if 
the sequence of its outputs stabilizes on a wrapper which implements the target 
semantics. Finally, a WIM learns a class C of wrappers if it learns every semantics 
that is implementable by a wrapper in C. 

Our first result points out the general limitations of wrapper induction. 

Theorem 10 The class A£iw is not learnahle from marked text. 

Proof. Consider the following quite simple collection of island wrappers for 
extracting 2-tuples. Let S = {o, 6}, let P be the set of rules depicted in Figure 3 
for the case of n = 2, and let II be the set of all predicates used in P. Now, 
we set Pi = {p!i(a)^}, P3 = {Ph{a)<—}, and P4 = {prsCa)^}. Additionally, 
for every j G IN, we set P® = {pn (a)<~) Pri (A)} as well as P|^^ = 

Now, for every j G IN, we let EFS Sj = [S, P, P U Pi U P| U P3 U P4). By 
definition, all island wrappers Sj do only differ in the left anchor language of 
the first extraction variable Ti. 

Finally, we claim that the collection {Sj | j G IN} is not learnable from 
marked text. This can be shown by applying arguments similarly to those used 
in [6] for proving that superfinite language classes are not learnable from ordinary 
text. We omit the details. □ 

In contrast, if there is a uniform bound on the cardinality of the relevant 
anchor languages, learning becomes possible. 

Theorem 11 For all k > 1, the class Af is learnable from marked text. 

Proof. Let Si^, G be an island wrapper to extract n-tuples and let t = 

(Pi, Pi), (P2, P2 ), ... be a marked text for Si^, under some fixed interpretation. 
We claim that the following WIM M learns Si^, when successively fed t. 

When learning an island wrapper from marked text, one may proceed as fol- 
lows: In a first step, decompose the overall learning problem into several prob- 
lems of learning anchor languages from ordinary text. In a second step, solve the 
derived individual learning problems independently and in parallel. In a con- 
cluding step, the solutions of the individual problems are combined to formulate 
a solution for the overall learning problem. 

There are three different types of individual learning problems to attack. 
One problem consists in learning the left anchor language of the first extraction 
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variable Vi (type A), another one in learning the right anchor langnage of the 
last variable Vn (type C). Moreover, there are n — 1 learning problems of type B 
that consists in simnltaneonsly learning the right anchor langnage of a variable Vi 
and the left anchor langnages of a variable (1 < * < n — !)• 

WIM M: On inpnt {Di, Pi), . . . , {Dm, Pm) do the following. 

Set t°, . . . , t" to be the empty seqnences. 

For each j = 1, . . . , m, do the following: 

For each pair ((«i, . . . , Sn), (pi, . . . ,Pn)) in Pj, compnte (identified by the 
given starting positions pi, ... ,pn of , . . . , s„) the nniqnely determined 
snbstrings wq, wi, . . . , Wn of Dj snch that Dj = woSiWiS 2 . . . 

Append wq to t°, tui to . . ., and to t". 

Let P be the rnles depicted in Fignre 3 and JI be the set of all predicates 
nsed to formnlate the rnles in P. Do in parallel: 

— On inpnt t°, rnn - an IIM for problems of type A. Fix P' = MA{t°). 

Po is bnilt by replacing, everywhere in P', the predicate p by pq . 

— For i = 1, . . . ,n — 1, compnte in parallel: 

On inpnt P , rnn - an IIM for problems of type B. Fix P' = Mb{P). 
Pi is bnilt by replacing, everywhere in P', the predicates p and q by p^- 
and pq_|.j , respectively. 

— On inpnt t", rnn Me ~ an IIM for problems of type C. Fix P' = Mc{P'). 
Pi is bnilt by replacing, everywhere in P', the predicate p by p,.^. 
Ontpnt the EFS {S, II, P') with P' = P U Pq U Pi U . . . U Pn. 

The IIM Ma for learning problems of type A is defined as follows. 

IIM Ma- On inpnt S = uq, ... ,Uk do the following: 

Set P' = 0. Determine the set E of all non-empty snfRxes of strings in S. 
For all strings e G P check whether or not, for all o G P, w = o o e for some 
u ^ S. Let T be the set of all strings e passing this test. Goto (al). 

(al) If T = 0, ontpnt P'. Otherwise, goto (a2). 

(a2) Determine a shortest string e in T. Set P' = P' U {p(e)<— } and T = 
T \ Te, where Tg contains all strings in T with the snfRx e. Goto (al). 

The IIM Me can be obtained from Ma by replacing everywhere the term 
snfRx by prefix. It is not hard to see that Ma and Me behave as reqnired. 

It remains to define an IIM for learning problems of type B. In the defini- 
tion of Mb, we let P" = {nc-q{X)<^c-q{X), c-q{X)<^q{X), c-q{XY)<^c-q{X), 
c-q{XY)i—c-q{Y), r{XY P)<— p(A), nc-q{Y), q{Z)} and W = {p, q, nc-q, c-q, r}. 

IIM Mb'. On inpnt S = uq, . . . ,Uk do the following: 

Let B and E be the set of all non-empty prefixes and snfRxes of strings in S. 
Moreover, let B' = {p(h)<— | b G B} and E' = {q(e)<— | e G P}. 

Let H be the collection of all sets h C B' U E' snch that h contains at most 
k rnles from B' and at most k rnles from P'. 

Search for an h G P snch that, for every w G P, r{u) G Sem{S) holds, 
where S is the EFS {E, II" , P" U h). If snch an h is fonnd, let P' be the 
lexicographically first of them. Otherwise, set P' = 0. Ontpnt P'. 
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The verification of Mb's correctness is a little more involved. Dne to the 
space constraints, the details have to be omitted. □ 



5 Conclusions 

The conventional concept of so-called Elementary Formal Systems has been 
generalized to Advanced Elementary Formal Systems (AEFS) which have been 
proven snfRciently expressive for representing certain wrappers of practical rel- 
evance. So-called island wrappers are appropriate for representing classes of 
HTML docnments nnder a particnlar perspective. Island wrappers are char- 
acterized as length-bonnded AEFSs. 

The nser is directed to http://LExIKON.dfki.de, where illnstrative examples 
can be fonnd which demonstrate the nsefnlness of the present approach. 

Learnability of formal langnages is known to be hard. Even island wrappers 
are not antomatically learnable from examples only. This is throwing some light 
at the difRcnlties of invoking learning techniqnes for information extraction from 
the Internet. 

The anthors introdnced additional constraints on the families of anchor lan- 
gnages which are indnced by island wrappers and investigated the impact of these 
constraints on learnability. Anchor langnages meeting snch a constraint tnrn ont 
to be learnable from positive examples drawn from given semi-strnctnred docn- 
ments. Thns, their corresponding island wrappers are learnable. 

The resnlts abont representability and learnability above are setting the stage 
for some specific approach towards antomated information extraction from the 
Internet. The basic scenario is as follows. 

Given a sample Internet docnment, a nser might have a particnlar view at 
the docnment and at the information contained therein which is relevant to 
him (her). By marking text passages from this docnment, the nser is specifying 
this view in some detail. Marked text passages resnlt in so-called marked text, 
which is a formal concept within the nnderlying theoretic setting. If snfRciently 
expressive examples have been marked, the intended view can be learned an- 
tomatically. This is done by indnctive inference of island wrappers which are 
particnlar AEFSs. Any learned island wrapper does not only allow to extract 
information from the docnments it has been synthesized npon, bnt it also applies 
to a potentially nnlimited nnmber of farther docnments not seen so far. 

The resnlts of the present paper jnstify scenarios of this type and prove 
that information extraction throngh indnctive learning does really work. Several 
problems are left to fntnre investigations. Among these, there is the qnestion for 
particnlarly efficient algorithms and the qnestion for generalization to hierarchi- 
cally strnctnred docnment sonrces. 

Finally, note that the representation formalism developed is well snited to 
describe other wrapper classes of practical relevance. For instance, the wrapper 
classes from [8] can easily be described and their learnability can be stndied 
within this formal framework. 
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Abstract. This paper proposes a method of discovering Web communi- 
ties. A complete bipartite graph of Web pages can be regarded as a 
community sharing a common interest. Discovery of such community is 
expected to assist users’ information retrieval from the Web. The method 
proposed in this paper is based on the assumption that hyperlinks to re- 
lated Web pages often co-occur. Relations of Web pages are detected by 
the co-occurrence of hyperlinks on the pages which are acquired from a 
search engine by backlink search. In order to find a new member of a 
Web community, all the hyperlinks contained in the acquired pages are 
extracted. Then a page which is pointed by the most frequent hyperlinks 
is regarded as a new member of the community. We have build a system 
which discovers complete bipartite graphs based on the method. Only 
from a few URLs of initial community members, the system succeeds 
in discovering several genres of Web communities without analyzing the 
contents of Web pages. 



1 Introduction 

According to an announcement of Inktomi and NEC Research Institute, the 
number of Web pages in the world surpasses one billion documents as of Jan- 
uary 2000[8]. In order to find useful pages from such vast Web network, many 
users rely on search engines such as AltaVista or Yahoo. However, they often 
find difficulty because of ubiquitous synonymy (different words having the same 
meaning) and polysemy (the same word having multiple meanings). A system 
which discovers related Web pages is expected to assist users’ information re- 
trieval from the Web. 

In general, Web pages contain various forms of information such as sentences, 
images, and sounds. Understanding all of such contents and classifying the pages 
appropriately are not easy tasks even for humans. A large number of studies have 
been made for finding relations of Web pages based on linguistic information. 
Brooder defines document similarity as the ratio of common subsequence of 
words[3]. Chang employs TFIDF as the criterion for document similarity [6]. 
Although these methods ai’e widely applicable to ordinai’y documents, utilizing 
hyperlinks, which are the information peculiar’ to Web pages, is expected to help 
greatly for accurate classification of Web pages. 

S. Arikawa and S. Morishita (Eds.): DS 2000, LNAI 1967, pp. 65-75, 2000. 

@ Springer- Verlag Berlin Heidelberg 2000 
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This paper proposes a method of discovering communities of Web pages based 
on the co-occurrence of references. A system for discovering Web communities is 
developed based on the method. The input to the system is a few URLs of Web 
pages about specific topic such as movies or sports. The output of the system is 
the communities of Web pages sharing common interests with the input pages. 

In this paper, a community is defined as a set of Web pages whose hyperlinks 
form a complete bipartite graph as mentioned in Kumar’s paper of Web 

Trawling[13]. In Ki j graph, each of i pages contains hyperlinks to all of the j 
pages. We will call the former i pages as fans and the latter j pages as centers in 
this paper. The procedure for community discovery is as follows: first, our system 
acquires fans which have hyperlinks to all the input centers by backlink search on 
a search engine. From the HTAIL files of the acquired fans, all the hyperlinks ai’e 
extracted. Then a page which is pointed by the most frequent hyperlink is added 
as a new member of centers. By repeating these two steps, a Web community is 
searched without analyzing the contents of Web pages. Since the method utilizes 
backlink information, implicit relation between the pages which have no direct 
hyperlinks to each other can be detected. The system succeeds in discovering 
several genres of communities only from a few input URLs. 



2 Related Work 

Hyperlinks often give authority to the contents they point to. Several attempts 
have been made on link analysis[IO] and Web visualization [9] [14] since hyperlinks 
are expected to give important clues for finding relations among Web pages. How- 
ever, this assumption is not always true because of the following reasons. First, 
hyperlinks between two pages in the same Web site very often serve a purely 
navigational function and typically do not represent conferral of authority [12]. 
Second, related Web sites frequently do not reference one another because they 
are rivals, they ai’e on opposite sides of thorny social issues, or they simply do 
not aware of each other’s presence[I3]. In order to find relations among Web 
pages, it is not enough to investigate hyperlinks frorn the pages; hyperlinks to 
the pages, which we call backlinks, are often more important. Although it is not 
possible to search all the backlinks that point to an arbitrary Web page, many 
of them can be obtained by backlink search on a search engine. 

The author has developed a Web visualization system] 15] based on the co- 
occurrence of references in a search engine, and its online demonstration is 
available at littp://www. cs.gunma-u.ac.jp/~tmurata/. The input to the sys- 
tem is a few URLs of ai’bitrai’y Web pages, and the output of the system 
is a graph which shows the relation of the URLs. Figure I shows the rela- 
tions of the following ten URLs: www.sprint.com, www.nici.com, www.att.com, 
www.ibni.com, www.conipaq.com, www.dell.com, cnn.com, www.usatoday.com, 
www.nytinies.com, and www.niicrosoft.com. You can find four clusters in this fig- 
ure: teleconununication companies, computer companies, news companies, and 
Alicrosoft. 

In order to measure the degree of relation between two URLs, this system 
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Fig. 1. Visualization based on the co-occurrence of references 



performs a search on AltaVista by using the URLs as keywords. The value of 
the number of pages searched from both LlRLs divided by the number of pages 
searched from either URL, which is called Jaccard coefficient, is calculated for 
evaluating the relation between the two LlRLs. This is because hyperlinks of 
related Web pages often co-occur in a Web page. The length of edges connecting 
two LlRLs is defined as the reciprocal of Jaccard coefficient of the two URLs so 
that related Web pages are located close to each other on a graph. This method 
of measuring the degree of relation based on the number of co-referring pages is 
similar to the technique of REFERRAL[11] which visualizes reseai’chers’ social 
network. 

For the discovery of relation or rank of Web pages based on the structure 
of hyperlinks, several researches have been made such as Kleinberg’s Clever 
project[7], Kumar’s Web Trawling[13], and Page’s PageRank algorithm [16]. HITS 
algorithm] 12], which is one of the central ideas of Clever project, employs au- 
thority and hub as the criteria for measuring the usefulness of Web pages. For 
any particular topic, there tend to be a set of authoritative pages focused on 
the topic, and a set of hub pages which contain links to useful, relevant pages to 
the topic. The algorithm associates an authority weight and a hub weight with 
each Web page. If a page is pointed to by many good hubs, its authority weight 
should be increased. Authority weight of a page is updated to be the sum of hub 
weights over all pages that link to the page. In the same manner, hub weight 
of a page is updated to be the sum of authority weights over all pages that are 
linked by the page. HITS algorithm iterates the calculation of both weights and 
outputs a list of the pages with the largest hub weights and the lai’gest authority 
weights. However, HITS algorithm needs to assemble a different root set for each 
target topic, and then to prioritize pages in the context of that particular topic. 

Kumar’s Web Trawling is the method for discovering communities from the 
graph structure of Web pages. For example, Web pages of aircraft enthusiasts 
have hyperlinks to all the homepages of major commercial aircraft manufacturers 
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such as Boeing and Airbus. These pages and hyperlinks compose a complete 
bipartite graph Ki j since each of i pages contains hyperlinks directed to all of the 
j pages. Kuniai’ regards such a graph in the Web network as a cyber-community 
sharing a common interest. By the search of graph from the snapshot data of 
Web network, more than a hundred thousand communities are discovered. 

Page’s PageRank algorithm calculates the rank of each Web page by propa- 
gating the probability distribution that users visit. For each page, its probability 
is calculated as the sum of the probabilities of the pages that link to the page. The 
propagation of probabilities is iterated until they converge. The algorithm effi- 
ciently computes ranks for 518 million hyperlinks, and it is employed in Google 
search engine (http : / / www . google . com/ ) . 

These three approaches make use of hyperlinks as clues for detecting relation 
or rank of Web pages. It is true that these three approaches are effective, but 
they require a large-scale database of HTAIL files. Since the number of Web 
pages in the world is increasing rapidly, building such a database that covers 
most of the Web network and renewing it are not simple tasks. 

Our discovery system described in this paper acquires data from other Web 
servers during the processes of discovery in order to use new abundant data. In 
addition to that, the system acquires backlink information from a search engine. 
The output of the system is a list of URLs which shai’e common interests with 
the input URLs. It often happens that a user is already familiar with some Web 
pages of specific topic and needs to find more pages about the topic. If a discovery 
system outputs a set of related pages which contains the user’s familiar’ pages, 
the result is easily accepted by the user. Discovery of a set of pages which are 
related to given pages is important for achieving Web recommendations that are 
convincing for a user. 



3 A Method of Discovering Web Communities 



The goal of our method is to discover a Web community shai’ing a common 
interest. Initial members of a community is a few URLs that a user has provided. 
The overall discovery procedure consists of the following three steps. Each of 
these steps is explained in the following subsections. 

1. Search of fans using a search engine 

2. Addition of a new URL to centers 

3. Sort of centers in the order of frequency 

As mentioned before, a complete bipartite graph A"yj, which is the target 
graph of our discovery method, is composed of a set of i pages and a set of j 
pages: each of the i pages has hyperlinks pointing to all of the j pages. In the 
following explanation, fans refer to the set of i pages and centers refer to the set 
of j pages. 
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3.1 Search of fans using a search engine 

In our method, input URLs are accepted as initial centers, and fans which co- 
refer all of the centers are searched. As shown in Figure 2, fans are searched 
from the centers by backlink search on a search engine. In general, popular 
centers have too many backlinks. In such cases, a fixed number of high-ranking 
URLs are selected as fans. Most of the search engines rank pages according to 
the relevance to input keywords. However, there is not much public information 
about the specific ranking algorithms used by current search engines[I]. In our 
method, high-ranking URLs are selected just as a matter of convenience. Since 
WWW is changing rapidly every day, acquisition of new data is indispensable 
for the discovery of current (not outdated) communities. By the backlink seai’ch 
on a search engine, relatively new data can be acquired through the internet. 



fans centers 




Result of the search 



Fig. 2. Search of fans using a search engine 



3.2 Addition of a new URL to centers 

The next step is to add a new URL to centers based on the hyperlinks of ac- 
quired fans. The fans’ HTAIL files are acquired through the internet, and all the 
hyperlinks contained in the files are extracted. The hyperlinks are sorted in the 
order of frequency. Since hyperlinks to related Web pages often co-occur, the 
top-ranking hyperlink is expected to point to a page whose contents are closely 
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related to the centers. As shown in Figure 3, the URL of the page is added as a 
new nieniber of centers. 

The above two steps are repeatedly applied in order to acquire many centers. 
In general, the number of fans decreases according as the number of centers 
increases since there are more Web pages that contain fewer hyperlinks. The 
above two steps ai’e repeatedly applied until there are few fans which refer all 
the members of centers. 



fans centers 




Fig. 3. Addition of a new URL to centers 



3.3 Sort of centers in the order of frequency 

Based on the above two steps, a community of related Web pages is generated. 
However, since our method is based solely on the co-occurrence of hyperlinks, 
genres of newly-added centers might be different from those of input URLs. For 
example, a community generated from the URLs of personal computers may 
contain the URLs of video games. It is not easy to detect such change of genres 
and to find true boundai’y of a community. If such boundary cannot be found, it 
is desirable for a user to rank the URLs of communities in the order of relevance 
to input URLs. In order to achieve such ranking, communities are generated 
many times: our method generates communities for every pair of input URLs. 
For example, if five URLs ai’e provided, ten {— 5 C 2 ) communities are generated 
from the pairs of 1st & 2nd URLs, 1st & 3rd URLs, ..., and 4th & 5th URLs. 
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Then the centers of all the generated coniniunities are sorted in the order of 
frequency. The sorted result is expected to reflect the rank of relevance to input 
URLs because highly-ranked URLs co-occur many times with the input URLs. 

4 Experiments 

Based on the method described above, a system for discovering Web communi- 
ties is developed using Java. It is desirable that input URLs are popular ones 
that many others refer to. As the input to the system, URLs of 100hot.com site 
(http://www.100hot.com/) are used in our experiments. 100hot.com is a col- 
lection of a hundred famous Web pages for 74 genres, such as art, movies, travel, 
and so on. The site is administered by Go2Net, and its ranking is based on the 
Web surfing patterns of more than 100,000 surfers worldwide. Some genres in the 
site are duplicated (such as “Entertainment/Book Sites” and “Shopping/Book 
Sites”), and some are sponsored by companies concerned (for example, “DVD 
Best Sellers” is sponsored by Amazon.com). These genres are excluded from our 
experiments. In order to discover communities for the remaining 33 genres, top 
five URLs of each genre are provided to the system as inputs. Our system gen- 
erates a community for every pair of the input URLs and outputs the centers of 
all the communities sorted in the order of frequency. 

In order to evaluate the quality of the system’s output URLs, the ranking 
of 100hot.com are regarded as the collection of “correct answers”. In another 
words, if a URL is listed in the 100hot.com ranking of corresponding genre, 
it is regarded as a “correct answer”, otherwise it is regarded as an “incorrect 
answer”. Since there are many relevant URLs which are not listed in 100hot.com 
site, this evaluation criterion is rather too severe for the system. However, we 
dare to employ this criterion since it clarifies the power of our system. As a search 
engine for backlink search, AltaVista (http : / / www . altavista . com) is used. The 
results of the experiments are shown in Table 1. The first column of the table 
shows genres. The second and third column of the table (total, correct) show 
the total number of acquired centers for corresponding genre, and the number of 
“correct answers” among them respectively. From fourth to seventh column (1(^, 
2(), 3(), 4()) show the number of “correct answers” in each quarter of ordered 
list of output URLs. For example, “!()” shows the number of “correct answers” 
which are located in the first quarter of the list of output URLs. 

Table 1 shows that the system performs very well for many genres. The 
system discovers these related pages only from five input URLs. As a detailed 
example, the result of genre “Kids” is shown below. The following five URLs are 
given to the system as inputs: 

— www.pbs.org 

— www.headbone.com 

— www.bolt.com 

— www.yahooligans.com 

— www.discovery.com 
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Table 1. Results of the experiments 



genre 


total 


correct 


IQ 


2Q 


3Q 


4Q 


Art 


45 


2 


2 


0 


0 


0 


Books 


117 


12 


10 


2 


0 


0 


Events 


172 


6 


4 


0 


0 


2 


Music 


184 


28 


6 


11 


8 


3 


Finance 


130 


42 


21 


10 


4 


7 


Jobs 


73 


27 


15 


6 


6 


0 


Loans 


142 


12 


10 


0 


0 


2 


College 


196 


47 


20 


13 


7 


7 


Kids 


172 


42 


30 


3 


9 


0 


Gambling 


53 


16 


6 


5 


4 


1 


Movies 


89 


8 


4 


0 


1 


3 


Games 


137 


36 


17 


8 


10 


1 


Family 


93 


3 


2 


1 


0 


0 


Food 


94 


7 


4 


1 


0 


2 


Gardening 


148 


6 


2 


1 


1 


2 


Pets 


146 


18 


9 


1 


0 


8 


Cai’s 


92 


40 


13 


16 


6 


5 


Chat 


96 


4 


0 


3 


0 


1 


Dating 


62 


10 


6 


4 


0 


0 


Spirits 


171 


21 


4 


4 


13 


0 


Travel 


124 


38 


12 


10 


16 


0 


Magazines 


74 


12 


6 


1 


1 


4 


Newspapers 


167 


40 


11 


23 


3 


3 


Auction 


158 


18 


9 


6 


3 


0 


Flowers 


141 


15 


8 


1 


0 


6 


Shopping 


162 


21 


6 


6 


9 


0 


Health 


130 


14 


10 


1 


1 


2 


Sports 


95 


26 


14 


7 


4 


1 


Developer 


123 


9 


4 


2 


3 


0 


Hardware 


164 


29 


17 


9 


2 


1 


Mac OS 


143 


29 


19 


2 


7 


1 


Unix 


95 


8 


5 


2 


1 


0 


Windows 


130 


9 


5 


1 


2 


1 


average 


124.8 


19.8 


9.4 


4.8 


3.7 


1.9 



The top 10 of output URLs are shown below: 

1. www.cyberkids.com 

2. www.ctw.org 

3. www.exploratoriuni.edu 

4. www.si.edu 

5. www.bonus.com 

6. www.kids-space.org 
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7. www.discovery.com 

8. www.youmleschool.com 

9. www.plaiietzoom.com 

10. www.kidscom.com 

All of these URLs except www.plaiietzoom.com (9th) are listed in Kids 
genre of 100hot.com. If you watch the contents of each URL, you will agree 
that www.planetzooni.com is also a site for Kids, although it is not listed in 
100hot.com. These results show that the system has abilities of discovering many 
URLs which are related to input URLs. 

5 Discussion 

5.1 The quality of discovered communities 

As shown in Table 1, the number of “correct answers” varies considerably with 
the genres of input URLs. This means that there are great difference in the 
quality of discovered communities. For example, communities for “Finance”, 
“College” , “Kids” , “Cars” , and “Newspapers” ai’e of better quality than those 
of “Art” , “Family” , and “Chat” . There are various factors for such difference in 
quality: the number of Web pages belonging to a genre, the number of hyperlinks 
contained in the pages of a genre, and the contents of a search engine used for 
backlink search. 

The method of discovery described in this paper is effective for the Web com- 
munities whose pages are densely connected by many hyperlinks. If the number 
of centers is rather limited and many fans refer to most of them, our system 
performs well. The difference of experimental results with genres are caused by 
users’ browsing pattern and link pattern for each genre, and it does not indicates 
the limitation of our method. 

5.2 Sort of centers in the order of frequency 

As mentioned in section 3.3, our system generates communities many times from 
every pair of input URLs, and sorts acquired centers in the order of frequency. 
Since it often happens that co-occurring hyperlinks point to pages of completely 
different genres, acquired URLs should be sorted in order to minimize bad in- 
fluences of accidental co-occurrences of hyperlinks. Table 1 shows that more 
“correct answers” are found in higher rank of the output list such as IQ and 
2Q. This means that the URLs which are closely related to the genre of input 
URLs are located in higher rank of the output list. Although our method is very 
simple, it succeeds in ranking URLs in the order of relevance to a certain extent. 

The list of URLs that our system discovers often contains some portal sites, 
such as news sites and search engines. In order to avoid such sites to be mixed 
in discovered communities, anchor descriptions on hyperlinks ai’e expected to 
be useful. In addition to the co-occurrence of references proposed in this paper, 
additional information about the contents of referring sites will contribute to the 
discovery of purer communities. 
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5.3 Data acquisition from WWW servers through the internet 

The system described in this paper acquires needed data from WWW servers 
and AltaVista in the process of discovery. Although the system accesses to sev- 
eral Web servers, its accesses do not concentrate on one server. The number of 
accesses to each server is at most the number of centers of a Web community. 
With the accesses through the internet, new abundant data are available for our 
discovery system. 

Alost of the time for community discovery is spent on the acquisition of 
HTML files through the internet. In our system, only the data acquired within 
a certain time limit are used for discovery. This time limit is inevitable in order 
to achieve discovery within practical time even when some WWW servers ai’e 
down or network condition is really bad. 



5.4 Comparison with related works 

In order to evaluate the quality of system’s output, previous researches such as 
HITS or Web Trawling perform experiments using human subjects, or just show 
the results in their papers. In this paper, the URLs listed in 100hot.com site are 
regarded as “correct answers”, and they are used for evaluation. Although this 
criterion is rather severe for the evaluation of related URLs, our system succeeds 
in discovering 19.8 “correct” URLs on average only from five input URLs. This 
result is hard to compare with other related researches, but it is surprising that 
our simple method succeeds in the discovery of so many related URLs. 

Our system is quite different from other Web discovery system in that needed 
data is acquired from WWW servers in the process of discovery. Systems of HITS 
and Web Trawling require considerable amount of Web pages to be collected and 
provided in advance. Their performance depend heavily on the amount and the 
quality of the input data. On the experiments of Kumar’s Web Trawling, used 
snapshot data are over a year and a half old. Therefore, some of the discovered 
communities have already disappeared in the current Web network. Since many 
Web pages are generated and disappeared every day, lai’ge-scale Web snapshot 
data are hard to be obtained and maintained. Our system acquires data in 
the process of discovery and discovers communities which actually exist in the 
current Web network only from a few input URLs. 



6 Conclusion 

This paper describes a method of discovering communities of related Web pages 
based on the co-occurrence of references. Our community discovery system seai’ches 
a complete bipartite graph from the data acquired through the internet. The re- 
sults show that our system has abilities of discovering several genres of related 
Web pages only from a few input URLs. There are a number of ai’eas for further 
work: 
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Detecting dynamic changes of communities 

Backliiik information acquired from a search engine may change as time 
elapses. By comparing the communities which are generated at regular time 
intervals will clarify dynamic changes of communities. 

Giving weights to hyperlinks based on their location 

Hyperlinks to related Web pages are often placed close to each other on a 
Web page. By giving weight to such hyperlinks, more accurate relation of 
Web pages will be discovered. 
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Abstract. New, more effective software tools are needed for the analy- 
sis and organization of the continually growing biological databases. An 
extension of the Self-Organizing Map (SOM) is used in this work for 
the clustering of all the 77,977 protein sequences of the SWISS-PROT 
database, release 37. In this method, unlike in some previous ones, the 
data sequences are not converted into histogram vectors in order to per- 
form the clustering. Instead, a collection of true representative model 
sequences that approximate the contents of the database in a compact 
way is found automatically, based on the concept of the generalized me- 
dian of symbol strings, after the user has defined any proper similarity 
measure for the sequences such as Smith-Waterman, BLAST, or FASTA. 
The FASTA method is used in this work. The benefits of the SOM and 
also those of its extension are fast computation, approximate representa- 
tion of the large database by means of a much smaller, fixed number of 
model sequences, and an easy interpretation of the clustering by means 
of visualization. The complete sequence database is mapped onto a two- 
dimensional graphic SOM display, and clusters of similar sequences are 
then found and made visible by indicating the degree of similarity of the 
adjacent model sequences by shades of gray. 



1 Introduction 

The amount of DNA sequences, protein sequences, and molecule structures stud- 
ied and reported, e.g., in the Internet is already overwhelming. One should de- 
velop better tools for the analysis of the existing databases. Thereby, however, 
it will also become possible to make new discoveries, without the need to carry 
out the real biological and chemical experiments. 

Among the new challenges one may mention finding the hidden relations 
between the data items, revealing structures from large databases, and repre- 
senting the results to the human in a comprehensible way. The classification and 
clustering of the sequences may reveal new unknown connections between them. 
The visualization of large data sets in a compact way may give insights into the 
data and lead to the development of new ideas and theories. 
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Although the data mining applications in general require very specialized 
and tailored solutions, it is interesting to note that some general principles and 
methods can already define a framework for these tasks. One such method is the 
Self- Organizing Map (SOM) [11,13,14,16]. It is a clustering and visualization 
tool, which has been applied to a diversity of problems. This paper points out 
the potential of the SOM in the clustering and organization of large sequence 
databases. 

The SOM has already been applied to the clustering of protein sequences. 
In [6], the sequences were converted into 400-dimensional dipeptide histogram 
vectors. In [7], similar amino acids were grouped together before computing the 
histogram vectors. In [9], the sequences were converted into vectors by fractal 
encoding. Before that the sequences were aligned. In [2], each position of the 
sequence was represented as a 20-dimensional vector; each vector component 
corresponded to one amino acid. The whole sequence was then converted into 
an L-by-20-dimensional vector, where L is the length of the global alignment of 
all sequences. As a conclusion, in all these works the data has been encoded by 
vectors before feeding to the SOM. 

A new method suggested by Kohonen [15], however, allows the organiza- 
tion of nonvectorial data items, too. The clustering and organization of the 
sequence database can therefore be based on any user-defined algorithm, e.g. 
Smith- Waterman [20], BLAST [1], or FASTA [19]. In the present work, the 
FASTA method was used for computing the sequence similarities. The SOM was 
then applied to clustering all the 77,977 protein sequences of the SWISS-PROT 
database, release 37 [3]. 

2 The Self-Organizing Map for both vectorial and 
nonvectorial data 

In its original form the Self-Organizing Map is a nonlinear projection method 
that maps a high-dimensional metric vector space, or actually only the manifold 
in which the vectorial samples are located, onto a two-dimensional regular grid 
in an orderly fashion [11,14]. The SOM differs from the traditional projection 
methods such as multidimensional scaling, MDS [17] in that unlike in the latter, 
each original sample is not represented separately, but a much smaller set of 
model vectors, each of the latter associated with one of the grid nodes, is made 
to approximate the set of original samples. The SOM thus carries out a kind 
of vector quantization, VQ [8], in which, however, the model vectors (called 
codebook vectors in VQ) may be imagined to constitute the nodes of a flexible, 
smooth network that is fitted to the manifold of the samples. 

The SOM principle is not restricted to metric vector spaces, however. It has 
been pointed out by one of the authors [15] that any set of items, for which 
a similarity or distance measure between its elements can be defined, can be 
mapped on the SOM grid in an orderly fashion. This is made possible by the 
following principle, which combines the concept of the generalized median of a 
set [12] with the batch computation principle of the SOM [14]. 
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X . (generalized median) 



Fig. 1. Illustration of the SOM algorithm for nonvectorial data. Each of the input 
items x(l),x(2), ... is copied into the sublist under that model that has the smallest 
distance from the respective input item. After that, the generalized median Xj in each 
neighborhood set Ni is determined, and the old value, say mj, is replaced by Xj. This 
cycle is repeated from the beginning as many times as the models are not changed any 
longer. 



Let us concentrate on the special SOM that is able to map nonvectorial 
items. Consider Fig. 1 in which a regular grid is shown, with some general model 
nic . . . nip associated with each grid node. Assume that a sublist that contains a 
subset of input items x(i) can be associated with each model. Each of the input 
items x(l),x(2), ... is compared with all the models and listed under that one 
that has the smallest distance from the respective input item. The x(l), x(2), . . . 
will thus be distributed under the closest models. 

Define for each model, say m,, a neighborhood set Ni (the set of models 
located within a certain radius from the node i in the grid). Consider the union 
of all the sublists within Ni (shown by the set line in Fig. 1) and try to find the 
“middlemost” input sample in Ni. This sample is called the generalized median 
of Ni , and it is defined to be identical with the input sample that has the smallest 
sum of distances from all the other samples of Ni. 

In forming the sum of distances, the contents of the sublists within Ni can 
be weighted so that the weight is a function of the distance of the nodes of the 
grid from, say, node i. This corresponds to the neighborhood function used with 
the traditional SOM [14]. 

Comment 1. If the input samples had been real scalars and the distance 
measure were the absolute value of their difference, it is easy to show that the 
“generalized median” coincides with the arithmetic median. 

Comment 2. If the input samples were real vectors, and the distance mea- 
sure were Euclidean, and if the item with the smallest sum of the squares of 
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distances from the other items were sought, the “generalized median” would 
coincide with the arithmetic mean of the union of the lists. In this case the “me- 
dian” is not restricted to the input samples, but belongs to the same domain. 

For each Ni in Fig. 1, i = c,d, . . . ,p the generalized median is now deter- 
mined, and the old models me . . . mp are replaced by the respective generalized 
medians. 

After this replacement, the original models have now been changed, and if 
the same input samples are compared with them, they are now redistributed in 
a different way in the lists. Eventually, however, in a finite number of iterations 
of this type the process will converge, after which the models approximate the 
input samples in an orderly fashion. 

It is not yet mathematically proven that the above process converges, at 
least into a unique equilibrium. In practice, convergence means that the lists will 
not be changed any longer in further iterations. Furthermore, there may exist 
alternative states into which the map may converge. A proof of a similar “batch 
map” process with vectorial items has been presented [4] , but any conclusions for 
nonvectorial items can only be drawn from the experimental results, for which 
no problems have so far existed. 

Comment 3. Like in the traditional SOM for vectorial items [14], the radius 
of the neighborhood set Ni in the beginning of the process may be selected as 
fairly large and put to shrink monotonically in further iterations. The speed of 
shrinking should be determined experimentally so that the global ordering is 
achieved. 



3 Clustering of 77,977 protein sequences 

The SWISS-PROT database, release 37 (12/98) [3] consists of 77,977 protein 
sequences. The sequences contain altogether 28,268,293 amino acid residues. 
Organization of a database of this size, and representation of the result in a 
compact form is a challenging task. Our purpose was to use the definition of 
distances between the protein sequences, as made in the FASTA method [19] for 
the computation of the SOM as described in the previous section. A 30-by-20 
SOM size was chosen. 

The convergence of the nonvectorial SOM algorithm is safer and faster, if the 
initial models are already two-dimensionally ordered, roughly at least, although 
not yet optimized. In a couple of earlier works [6,7], protein sequences were 
ordered according to the similarity of their dipeptide histograms. We found this 
method useful for the definition of a rough initial order to the SOM. Then, 
however, extra auxiliary model vectors have to be introduced and associated 
with the nodes. The initial ordering of the vectorial models in this auxiliary 
SOM proceeded in the traditional way. Each map node was provided with a 
400-dimensional model vector, each component of which was initialized with a 
random value between zero and unity, whereafter the vectors were normalized 
to unit length. Training was made by the 400-dimensional dipeptide histograms 
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using 30 batch cycles. A Gaussian neighborhood kernel, the standard deviation 
of which decreased linearly from 30 to 1 during training, was used. 

Next the nodes were labeled by those protein sequences that represented 
the medians in the sublists under the respective nodes (cf. Fig. 1). When this 
labeling was ready, the vectorial parts of the models could be abandoned, and 
the ordering could be continued by the method described in Sec. 2. 

After this initializing phase, the true protein sequences were used as inputs 
as described in Sec. 2 and the winner nodes were determined by the FASTA 
method. The source code for the FASTA computation was extracted from the 
FASTA program package, version 3.0 [18]. The parameter ktup was set to 2, 
the amino acid substitution scores were taken from the BLOSUM50 matrix, and 
the final optimized score for the sequence similarity was computed by dynamic 
programming. 

The SOM was trained for twenty batch cycles, using the neighborhood radius 
of one. (Since the SOM was already ordered, there was no need to use a shrinking 
kernel any longer.) Since the sequence similarities instead of their distances were 
finally computed, for the “median” we had to take that sequence in the union of 
the neighboring sublists that had the largest sum of similarity values with respect 
to all the other sequences in the neighboring lists. The Gaussian neighborhood 
function was applied for the weighting of the similarities. 

It would have presented a very high computing load to the algorithm if all 
the 77,977 protein sequences had been used as inputs at each batch computation 
cycle. The computing load could be reduced to less than ten percent, without 
essentially deteriorating the (statistical) accuracy of the batch computation, by 
randomly picking up 6,000 sample sequences from the 77,977 ones for each batch 
cycle. After 20 such sampled training cycles, one final training cycle was carried 
out using all the available sequences as the inputs. 

The resulting SOM is shown in Fig. 2. The map nodes have been labeled 
according to the identifiers of the final prototypes that resulted in the “median 
map” method. 

For comparison, another labeling was carried out by listing all data sequences 
under the best-matching nodes and then performing the majority voting for each 
list according to the PROSITE classes, release 15 [10] of the sequences. This 
result is shown in Fig. 3. Since the PROSITE database did not give any class 
for 37,743 sequences of the SWISS-PROT database, the PROSITE label of the 
node does not necessarily characterize all sequences of the node. 

The clusters can be characterized by means of the known protein families. 
Those classes whose members are strongly similar are mapped to small areas on 
the map, while other classes may be spread more widely. Actins and rubisco- 
large are examples of the classes which form sharp areas on the map. Globin 
is a large family which is composed of subfamilies. The globin sequences are 
mostly mapped on the top-left corner of the SOM. Hemoglobin beta chains 
are represented on the corner, hemoglobin alpha chains are in the cluster below 
catalases, and myoglobins are located below hemoglobin alpha chains. One sharp 
cluster on the top of the map consists of efactor-gtp sequences. Between globins 
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Fig. 2. A 30-by-20-unit hexagonal SOM grid. The SOM was constructed using all 
the 77,977 protein sequences of the SWISS-PROT release 37. Each node contains a 
prototype sequence and a list of data sequences. The labels on the map nodes are the 
SWISS-PROT identifiers [3] of the prototype sequences. The upper label in each map 
node is the mnemonic of the protein name and the lower label is the mnemonic of the 
species name. The similarities of the neighboring prototype sequences on the map are 
indicated by shades of gray. The light shades indicate a high degree of similarity, and 
the dark shades a low degree of similarity, respectively. Light areas on the map reveal 
large clusters of similar sequences. 
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Fig. 3. Clustering of 77,977 protein sequences using a 30-by-20 unit SOM. The proto- 
type sequences of the map nodes are the same as in Fig. 2. Each node is labeled by 
majority voting of the sequences having that node as their best-matching unit. The 
labels are the PROSITE classes [10] of the sequences. 



and efactor-gtp there is a cluster of the hsp70 family. Tubulins are mapped to 
two closely located areas, one of which is characterized by alpha subunits and 
another by beta subunits, respectively. 

Since there are altogether 1,352 classes in the PROSITE database, not all 
of them can be discussed in detail. But a general idea of the capability of the 
SOM can be gained by investigating the projections of the most frequent classes. 
Therefore the PROSITE classes were sorted according to their frequency in the 
SWISS-PROT database. The 32 most frequent classes were then projected on 
the SOM by finding the best-matching unit of each sequence belonging to the 
given class. The resulting class distributions are shown in Pig. 4. 

In the visualization of the class distributions, some PROSITE classes were 
combined. Por example, the actins class consists of 249 sequences of the fam- 
ily actins_actJike, 232 sequences of actins_2, and 227 sequences of actins.l. 
Trypsin_ser and trypsinJiis were combined to the single trypsin class. ThioLprote- 
ase_asn, thioLproteaseJiis, and thioLprotease^er were combined to the sin- 
gle thioLprotease class. Cytochrome.b class in the figure consists of both cy- 
tochrome.b.qo and cytochrome.bJieme. The distribution of protein_kinase.atp 
(1040 sequences) is not shown, because it was identical with the distribution of 
the protein_kinase_dom (1093 sequences). 

Analyzing the cluster contents according to known protein families can give 
information about the specificity of the prototype sequences, like in the organi- 
zation of the database performed on the basis of the sequence similarities. The 
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classification of the sequences according to the PROSITE classes, however, may 
also include structural information about the protein molecules. At any rate, 
many PROSITE classes were mapped to small and sharp areas on the SOM 
display. 

Once the SOM has been trained, it is very fast to compute the projection 
of any new sequence. This requires only as many sequence comparisons as there 
are prototype sequences on the map. In the current work, the SOM contained 
600 prototype sequences. Thus the work needed for classifying the new sequence 
into a prototype class is considerably lighter than comparison with all the 77,977 
sequences of the whole database. 



4 Discussion 

This paper is based on the combination of two new possibilities: accessibility to 
masses of biological data in the Internet, and recent development of a clustering 
and visualization method that can cope with the masses of raw nonvectorial data 
in an unsupervised way. 

The currently existing search engines for biological databases may give thou- 
sands of matches as a result of a short DNA sequence as a query sequence. The 
SOM can serve as a global visualization display, onto which also the results ob- 
tained by other means can be mapped. The sequence similarities can then be 
investigated based on the projections of the sequences on the map. The results 
for one query sequence can be all mutually similar or they can form distinct 
clusters, which can be reached by visual browsing. 

The special Self- Organizing Map for symbolic items has been applied in this 
work for the first time to a major problem, self-organization of the 77,977 protein 
sequences of the SWISS-PROT database. Contrasted with earlier works, this 
extension of the SOM allows the use of any similarity measure for sequences. The 
resulting clustering and ordering of the data reflects the properties of the chosen 
similarity measure. The present result, where the similarities are computed by 
the EASTA method, is a two-dimensional map where similar proteins are mapped 
to the same node or neighboring nodes, and the structures of the clusters are 
thereby visualized, too. The geometrically organized picture makes it possible 
to illustrate the relationships of a large amount of sequences at a glance. 

Since the SOM provides an ordered display of the representative prototype 
items of the data set, it may be used, e.g., for designing oligonucleotide or cDNA 
arrays (see [5] for a collection of reviews on microarray analysis). If the arrays 
were ordered using the SOM, similar oligonucleotides would be located close to 
each other in the array thus helping the visual interpretation of the data. 

A great advantage of the SOM is that the basic form of the algorithm is very 
simple and straightforward to implement. It is therefore easy to apply the SOM 
to various tasks. The SOM can be used as a data mining and visualization tool 
for any data set, for which a similarity or distance measure between its elements 
can be defined. 
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Fig. 4. Projections of the 32 most frequent PROSITE classes of the SWISS-PROT 
database on the SOM. Each subfigure represents the distribution of one class. The 
prototype sequences of the map nodes are the same as in Fig. 2. The shades of gray 
indicate the number of the protein sequences belonging to the given class in each map 
node. The maximum value (darkest shade of gray) is scaled to unity in each subfigure. 
The total number of the sequences in each class is shown in the parentheses after the 
PROSITE name. 
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Abstract. Inferring functional relations from relational databases is im- 
portant for discovery of scientific knowledge because many experimental 
data in science are represented in the form of tables and many rules 
are represented in the form of functions. A simple greedy algorithm has 
been known as an approximation algorithm for this problem. In this al- 
gorithm, the original problem is reduced to the set cover problem and 
a well-known greedy algorithm for the set cover is applied. This paper 
shows an efficient implementation of this algorithm that is specialized 
for inference of functional relations. If one functional relation for one 
output variable is required, each iteration step of the greedy algorithm 
can be executed in linear time. If functional relations for multiple out- 
put variables are required, it uses fast matrix multiplication in order to 
obtain non-trivial time complexity bound. In the former case, the algo- 
rithm is very simple and thus practical. This paper also shows that the 
algorithm can find an exact solution for simple functions if input data 
for each function are generated uniformly at random and the size of the 
domain is bounded by a constant. Results of preliminary computational 
experiments on the algorithm are described too. 



1 Introduction 

Many scientific rules are represented in the form of functions. For example, an 
output value yj may be a function of severed input variables , . . . , a;,^ (i-C-, 
yj — , • • • , For another example, a simple differential equation of the 

form ^ = fj(xi ^, . . . , Xj^ ) can also be considered as a function if we can know 
the values of ^ (e-g-, using in place of ^)- Moreover, many experimen- 
tal data in sciences are represented in the form of tables. Therefore, inferring 
functional relations from tables is important for scientific discovery. Since a rela- 
tional database consists of tables, this problem is almost equivalent to inference 
of functional relations from relational databases. 

S. Arikawa and S. Morishita (Eds.): DS 2000, LNAI 1967, pp. 86-98, 2000. 

@ Springer-Verlag Berlin Heidelberg 2000 
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Inference of functional relations (or almost equivalently, inference of func- 
tional dependencies) from relational databases is rather a classical problem 
in the field of KDD (knowledge discovery in databases) [2,9-11]. Since yj — 

holds for all , . . . , a;,^, if = fj{x,^,...,x,^) 
holds, it is usually required to find the minimum set or the minimal sets of 
input variables. Mannila and Raiha proposed a heuristic algorithm for finding 
functional dependencies [10]. Inference of functional dependencies with small 
noises was also studied and PAC-type analysis was made [2,9]. Unfortunately, 
Mannila and Raiha proved that finding a functional dependency with the uiini- 
umui number of input attributes (i.e., d is minimum) is NP-hard [11]. Therefore, 
development of heuristic algorithms and/or approximation algorithms is impor- 
tant. As mentioned before, Mannila and Raiha proposed a heuristic algorithm 
[10]. Akutsu and Bao proposed a simple greedy algorithm in which the original 
problem was reduced to the set cover problem [3]. Although an upper bound 
on the approximation ratio (on d) is given, the time complexity is not low if 
it is implemented as it is. Even for finding a functional relation for one output 
variable (i.e., one yj), it takes O(m^ng) time (0(m‘^n) time using an efficient 
iuipleuientation for the set cover problem [7]), where n denotes the number of 
attributes, m denotes the number of tuples and g denotes the number of main 
iterations in the greedy algorithm. This time complexity is too high for applying 
the greedy algorithm to large databases. 

This paper gives a simple iuipleuientation of the greedy algorithm, which 
runs in O(mng) time in the case of finding a functional relation for one out- 
put valuable. Each iteration can be done in liiieai' time since the size of input 
data (i.e., input table) is 0(mn). This complexity is reasonable because g is 
usually small (e.g., < 10). This algorithm has some similaiity with decision tree 
construction algorithms, where the similarity and difference are to be discussed 
in the final section. By the way, in some applications, it is required to infer 
functional relations for multiple output valuables simultaneously. For example, 
in inference of genetic networks [4], functional relations should be inferred for 
all genes (i.e., for n genes). In such a case, O(mn^g) time is still required us- 
ing the efficient implenientation mentioned above. Therefore, we developed an 
improved algorithm for a special case in which g can be regai'ded as a constant 
and the size of the domain is bounded by a constant. This algorithm is based on 
a fast matrix multiplication algorithm [6] as in [4], and the time complexity is 
0(m‘'^^^n^ -\- where lo is the exponent of matrix multiplication (cur- 

rently, uj < 2.376 [6]). Although this algorithm is not practical, it is faster than 
the O(mn^g) time algorithm when m is large. 

This paper also gives an average case analysis of the greedy algorithm for 
simple functions (such as AND of literals, OR of literals), under the condition 
that input data are generated uniformly at random and the size of the domain is 
bounded by a constant. In this case, the greedy algorithm finds an exact solution 
with high probability, where the probability is taken over all possible input 
data. This gives another theoretical guarantee to the algorithm. Recall that it is 
already known that the greedy algorithm outputs a solution with a guaranteed 
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approximation ratio even in the worst case [3]. Therefore, the greedy algorithm 
works very well for the average case inputs, whereas the greedy algorithm does 
not work so badly even in the worst case. 

In order to ensure the effectiveness of the algorithm, we made preliminary 
computational experiments. Since we were pai'ticulaiiy interested in the appli- 
cation to inference of genetic networks, we made computational experiments on 
Boolean networks, where the Boolean network is a mathematical model of a ge- 
netic network [13]. The results of computational experiments suggest that the 
greedy algorithm is very useful. 

Before describing details, we briefly discuss about the difference between 
association rules [1] and functional relations, since inference of association rules 
is well studied. In order to explain the difference, we consider very simple rules 
on the binary domain (0, 1}. The following is a typical example of an association 
rule: = 1) A = 0) A {x^^ = 1) — > yj — 1, which means that if x^^ — 1, 

Xi^ — 0 and Xi,^ — 1 hold then yj should be 1, where we are considering a 
simplified version of an association rule and thus we do not consider support 
and confidence. In this case, yj can take any value if either one of x^^ fi- 1, 
0, a;jg fi- 1 holds. The following is a typical example of a functional relation: 
yj — A ^Xi^ A a;jg, which means that yj is 1 if and only if = 1, 

Xi^ = 0 and Xi^ — 1 hold. Therefore, yj should be 0 if either one of x^^ fi- 1, 
a;;g 0, x^^ 1 holds. Although association rules are convenient for representing 

various kinds of knowledges, functional relations seem to be more appropriate 
for representing concrete rules (such as differential equations). 

2 Preliminaries 

In this paper, we assume a fixed and finite domain T> for all attributes, where 
extension to cases in which different domains are assigned to different attributes 
is straight-forward and thus omitted. Extension to the domain of real numbers 
will be discussed in Section 3. 

For simplicity, we consider two sets of attributes: the set of input attributes 
and the set of output attributes^ where these two sets are not necessarily disjoint. 
Usual functional dependencies can be treated by letting both sets be identical 
to the original set of attributes. Input attributes and output attributes are also 
called input variables and output variables^ respectively. Let a;i, . . . , a;„ be input 
variables. Let t/i , . . . ,yi be output variables. Let {xi{k )^ . . . , xfik)^ • • • , 

yi(k)) be the k-th. tuple in the table, where Xt{k) £ T*, yj{k) £ V for all ifi^k. 
Then, we define the problem of inference of functional relations in the following 
way. 

INPUT: {xfik),. . . ,Xn{k),yi{k),. . . ,yi{k)) {k — l,...,m), where xfik) £ V 
and yj(k) £ V for all i^j^k^ 

OUTPUT: for each yj^ a set Xj — {x ^^ , • • • , } with the niininmm cardinality 

(i.e., niininmm d) for which there exists a function fj{xi ^ , • • • , ) such that 

{yk)(yj(k) = ffixifik),. . . ,x,fik))) holds. 
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It should be noted that the problem is defined as a ininiinization problem 
since the number of sets of input variables satisfying the condition can be expo- 
nential. If multiple sets with the same cardinality satisfy the condition, any set 
can be output. It should also be noted that we do not require the explicit rep- 
resentation of the function fj because the number of possible /j’s with d-input 
variables on domain V is and thus exponential space log \ V\)) is 

required in order to represent fj. Thus, we do not mind representation of fj. 
Instead, we only mind sets of input variables. 

Since each yj can be treated independently, we assume I = 1 unless otherwise 
stated. 



3 Simple Greedy Algorithm 

It is known that inference of functional dependencies is NP-hai'd [11]. Therefore, 
a simple greedy algorithm has been proposed [3]. We denote this algorithm by 
GREEDY. In GREEDY, the original problem is reduced to the set cover problem 
and a well-known greedy algorithm for the set cover [8] is applied. The following 
is a pseudo-code for GREEDY. 

S ^ {{ki,k 2 )\ ki < k 2 and yi{ki) yf yi{k 2 )} 

^ {} 

X <r- [Xl, . . .,Xn} 
while 5 yf {} do 
for all Xi G X do 

5, ^ {(ki,k 2 ) e S\x,(ki) yf xfk 2 )} 
let Xt be the variable with the niaxinmm |5;| 

s ^ s-s, 

X ^X-{x,} 

X\ t— X\ U {xi} 
output Xi 

Using the result on the approximation ratio of the greedy algorithm for the 
set cover [8], the following result was obtained. 

Theorem 1. [3] Suppose that fi(xi^{k)^ . . . ^ Xi^{k)) — y\{k) holds for all k. 
Then GREEDY outputs a set of variables {a;y , . . . , a;,/ } such that g < {2\iim -\- 
l)d and there exists a function f[ satisfying /](a;p^ (A;), . . . , a;,/ (k)) — yi{k) for 
all k. 

Note that I7(log m) lower bound on the approximation ratio was also proven 
in [3]. Thus, the approximation ratio of GREEDY is optimal except a constant 
factor. 

GREEDY may be modified for the domain of real numbers by replacing 
yi(ki) yf yi(k 2 ) and xfki) yf Xt{k 2 ) with \yi{ki) - yi(k 2 )\ > S and \xi(ki) - 
3^1 (^ 2 ) I > S' respectively, using appropriate S and S'. 
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4 Efficient Implementation 

GREEDY takes O(rrP'ng) time (for I = 1) if it is executed as it is, where g is the 
iimuber of the iterations of the while loop (i.e., the size of Yi). GREEDY takes 
0(irP"n) time even if an efficient implenientation [7] for the set cover problem is 
used. For large n, m, it would take too long time. Therefore, we should consider 
more efficient implenientation. In this section, we will describe an improved 
implenientation of GREEDY (we call it GREEDYl) which works in O(mng) 
time. Since the size of input data is 0(mn), this means that each iteration can 
be done in linear time. 

We assume without loss of generality that T* = (0, 1, 2, . . . , D — 1}, where 
£> is a constant (recall that we assumed a finite domain). We partition the set 
of tuples into blocks i?i, . . . , Bh^ where each tuple {xi{k )^ . . . , yi{k)) is denoted 
by k. The method of pai'tition will be described later. For each block R/j, for 
each input vaiiable a;,, and for each (p, </) £ R x R, we define c/j ^(p, </) by 
Ch,i(jkq) = \{k e Bh\x,(k) = p and yi(k) = q}\. Then, we define by 



Ch,i = ^ ( '^Ch,i{p.q)-Ch,i{p'.q') 

?<?' \ps^p' 

We say that (k^k') is coveredhy x^ if both yi{k) ^ yi{k') and Xi{k) ^ Xi(k') 
hold. Then, Ch,i denotes the number of pairs (k^k') from Bh that are covered by 
Xi (i.e.. Si = Y,h 

Then, the following pseudo-code describes the improved algorithm: 

Ri t— (1, . . . , m} 

H y- 1 

^1 ^ {} 

Y i , . . . , } 

while \[yi{k)\k £ Rft}| > 1 for some h < H do 

for all Xi G X do 

Cz t— Ch,i 

Let Xi be the variable with the maxinium c, 
for all Bh and p (p = 0, . . . , R — 1) do 
Bh,p t- {^5 £ Bh\xi{k) = p} 

Remove Bhj> such that = {} 

Replace Ri, . . . , B[j by the remaining Bh^s 
X^X-{xi} 

Yi Yi U [xi} 

output Yi 

Note that condition \{yi{k)\k £ Bh}\ > 1 means that there exists a pair of 
tuples from Bh which should be covered by using additional variables not in Yi. 
Fi'oin the definitions of Bh and Ch^n it is easy to see that GREEDYl is equivalent 
to GREEDY. 
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Fig. 1. Example of execution of GREED Yl. In the lirst iteration, is selected because 
Cl = 3, C 2 = 3 and cs = 4. In the second iteration, either xi or X 2 is selected because 
Cl = 1 and C 2 = 1. 



Next, we consider the time complexity. Since i?ft’s are the partition of (1, 2, 

. . . , m}, Bh p^s can be computed in 0{m) time (per while loop). So, the most time 
consuming part is computation of Ch^i(p^ qYs. In order to compute Ch^iip, </)’s, we 
use the following procedure: 

for all p, q e 12 X 12 do 
Ch,i(p,q) f- 0 
for h = 1 to do 
for all k ^ B/t do 

p f- Xi(k) 

q ^ yi{k) 

Ch,i{p,q) ^ Ch,i{p^y) + 1 

Since D is assumed to be a constant and \Bh \ < ftiis procedure works in 
0(mn) time. Therefore, the total time complexity is O(mng). We can use the 
same space to store Ch^iip^qYs for different xYs. Thus, the space complexity is 
linear (i.e., 0(mn)). 

Theorem 2. Suppose that fi(xt^{k)^ . . . ^ Xi^{k)) — yi{k) holds for all k. Then 
GREEDYl outputs a set of variables (a;p , . . . , } in 0{mng) time such that 

g < (2him+l)<i and there exists a function /] satisfying f[{xy^{k )^ . . . ,a;p {k)) — 
Hi (k) for all k. 

Even if D is not a constant, the total time complexity is O(mng\ogn) by 
using an appropriate data structure for maintaining c^ f p^q). 

Until now, we assumed I = 1. But, in some cases, we should find functional 
relations for many yYs. For example, in inference of genetic networks [4], we 
should find functional relations for all genes (i.e., / = n), where the number 
of genes is at least several thousands even for micro-organisms such as Yeast. 
It would take 0{mn^g) time if we apply GREEDYl to all yj independently. 
Therefore, it is worthy to consider an efficient implenientation for the case of 
I = n. 
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For small g and small D (for example, g^D < 3), we can develop an improved 
algorithm using a fast matrix multiplication as in [4]. Here, we show a brief sketch 
of the algorithm. Recall that the most time consuming part of GREEDYl is 
computation of c/j ^(p, </)’s. Recall that c/j ^(p, </) is defined as follows. 

Ch,i(p^q) = |{^ e Bh\xt(k) = p and yj(k) = q}\. 

Since Ch^iip^q) must be computed for each pj, we use Ch^ijip^q) to denote the 
value of Ch,i(p,q) for p^. 

Here, we define an m- dimensional vector a;f by 

j, ^ r 1 , iix,{k)^p, 

■ ^0, otherwise, 

where (®f)fc denotes the k-ih. element of a vector a;f . We define in a similar 
way. Then, Ch^ijip^q) is equal to the inner product ■ y’j. Therefore, for each 
fixed h,p,q, we can compute Ch i j{p^qYs by using a matrix multiplication as in 
[4]. Of course, this computation should be done for all combinations of h,p, p. 
But, if g and D are constant, the number of combinations is a constant. In 
such a case, the total time complexity is where lo is the 

exponent of matrix multiplication (currently, w < 2.376 [6]). It is smaller than 
0{mn^) if m is large. We denote this algorithm by GREEDY2. 

Theorem 3. Suppose that fj(xi^(k )^ . . . , Xi^{k)) — yj{k) holds for all k and for 
all pj. Suppose also that g (the number of iterations) and D are bounded by a 
constant. Then GREEDY2 outputs a set of variables (a;p , . . . , for all yj 
in 0(m‘'^^^n^ + time such that there exists a function f- satisfying 

fj{^i[{k),---,x,,Jk)) = yj(k) for all k. 



5 Average Case Analysis for Simple Functions 

Even if we only consider the Boolean domain V — {0, 1} and we restrict functions 
to be either AND of literals or OR of literals, the inference problem remains 
NP-hard [4]. But, on the average, GREEDY (or equivalently GREEDYl) finds 
correct sets of input variables with high probability in that case, where the 
average is taken over all possible inputs and we assume that input data are 
generated uniformly at random. We also assume that output value pi depends 
only on input variables 3;,^ , . . . , a;,^. In this section, we show a sketch of the 
proof for this case and then we discuss the extension to other functions and 
other domains. 



5.1 Analysis of AND/OR Functions 

For simplicity, we only consider the following function: 

yi(k) = 



Xijk) A Xi^{k} A • • • A Xj^(k). 
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1st iterartion 



2nd iterartion 




m ^3 


>^7 


^ a- f " 


0 




0 


[f \ 0 


0 


1 1 


1 



Fig. 2. Illustration for average case analysis of a Boolean function y\ {k) — Xi (k) A 
X' 2 (k) A xsik). In the first iteration, ' I'^l ~ pairs are covered. In the second 

iteration, • l^l = 2^ pairs are covered. 



From the syiiiiiietry, the other AND functions and OR functions can be treated 
in an analogous way. 

Since we assume that input data are generated unifornily at random, 
Prob(x,(k) = 1) = Prob(x,(k) = 0) = 0.5 holds for all x, and for all k. Therefore, 
for each assignment A to a;,^ , . . . , a;,^, the number | {A;|(a;,^ (A ;), . . . ,x,^ (k)) — A}\ 
is expected to be very close to in most cases if m is sufficiently large, where 
we denote an assignment by a vector of 0, 1’s (for example, (0,1,1) denotes 
a;,j = 0, a;, 2 = 1, x,^ = 1). Among 2** possible assignments to x,^(k),. . . ,x,^(k), 
only (1, 1, ... , 1) can make yi(k) — 1. Therefore, at the first line of GREEDY, 
1‘S'I ~ X is expected. If x,^ ^ {a;,^ , . . . , a;,^}, Prob(x,Jk) yf 

— \ holds for each pair {k,k') £ S because a;,’s are assumed to be 
independent. Therefore, for a;,, ^ {x,^, . . . , x,^}, |5,.| ~ ||5| is expected. On 

the other hand, if a;,, £ [x,^ , • • • , Prob{x,^ (k) yf a;,, (k')) — yrin Isolds and 

thus |5,. I ~ |dvr[|5'| is expected (see Fig. 2). Since |a—f 1‘S'I > y |S'|, it is expected 
that one of a;,^ , . . . , x,^ is selected in the first iteration of GREEDY. 

Assume that x,^ is selected in the first iteration. Then, at the beginning of 
the second iteration, |5| ~ is expected. If a;,, ^ {a;,^ , . . . , a;,^}, 

1 5,. I ~ y|5| is expected too. If a;,. £ {a;,^, . . . , a;,^}, |5,. | ~ l-SI is expected. 

Therefore, it is expected that one of a;,^, . . . , x,^ is selected in the second iteration 
of GREEDY. 

In this way, it is expected that a;,^ , . . . , x,^ are selected and output by 
GREEDY. Making detailed probabilistic analysis, we can obtain the following 
theorem. 

Theorem 4. Suppose that V = (0, 1} and functions are restricted to be either 
AND of d literals or OR of d literals, where d is a constant. Suppose that input 
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data are generated uniformly at random (i.e., Proh(x,{k) = 1) = 0.5 for all x, 
and for all k). Then, for sufficiently large m (m — ^^(alogn)), GREEDY outputs 
the correct set of input variables {x ,^, . . . , a;,^} for yi with high probability (with 
probability > 1 — ^ for any fixed constant a). 

Note that if we choose d variables with d-highest |5;|’s in the first itera- 
tion of GREEDY, we can output the correct set of variables , a;,^} with 

high probability. However, the chance that correct a;,^ is selected in the second 
iteration of GREEDY is higher because > |avr[- Therefore, GREEDY 

(equivalently GREED Yl) should be used. 



5.2 Towards Analysis of More General Functions 

Unfortunately, GREEDY can not output the correct set of variables for XOR 
functions. For example, consider the case of yi{k) — Xi^{k) 0 Xt^{k). Then, 
~ ^|5| and ~ ||5| are expected. Thus, GREEDY may fail to find a;;^ 
or Xj^. However, this case seems to be an exceptional case. Fi'om computational 
experiments on d < 4, we found that properties similar to that in Section 5.1 
held for many Boolean functions of d < 4. Although we do not yet succeed, we 
are trying to clarify the class of functions for which GREEDY outputs correct 
sets of input variables in the average case. 

Extension of the Boolean domain to other fixed domain V is possible. In this 
case, AND function may be replaced by a function of the form 

if x^^{k) = zi A Xi^{k) = Z 2 A • • • A = Zd then yi = Zd+i else yi = Zd+ 2 , 

where z, £ V. If \V\ is a constant, an analysis similar to that in Section 5.1 is 
possible. But, if \V\ is large, it is difficult to discriminate x^^, .. ., x^^ from the 
other a;;,’s because the gap between |5,|’s will become very small. 

Although we assumed noiseless cases, real data contain noises. For exam- 
ple, yj{k) = fj{xt^{k),...,Xt^{k)) may not hold with some error probability t 
(i.e., Prob{yj{k) yf fj{xi^{k),...,Xj^{k))) — e). However, it is expected that 
GREEDY still works for this case. Although GREEDY would output variables 
not in (a;;j, . . . ,Xi^} if all pairs in S should be covered, we can stop GREEDY 
after the d-th iteration. In this case, it is expected for sufficiently small e and 
sufficiently large m that GREEDY will output the correct set of variables. 

6 Preliminary Computational Experiments 

We made preliminary computational experiments in order to ensure the effec- 
tiveness of GREEDYl. As mentioned in Section 1, we are particularly interested 
in the application to inference of genetic networks. Therefore, we made compu- 
tational experiments using a mathematical model of a genetic network. 

For modeling genetic networks, various mathematical models have been pro- 
posed. The Boolean network is a simplest model among them [13]. Although this 
is a conceptual model and real genetic networks are much more complex, this 




A Simple Greedy Algorithm for Finding Functional Relations 



95 



model may be still useful for analysis of real experimental data because the er- 
rors of real experimental data are large and thus real values should be classified 
into several discrete values (e.g., 0,1 in the Boolean network). Thus, we used 
Boolean networks in this computational experiment. We can treat inference of 
Boolean networks by letting V = (0, 1} and / = n [4]. However, we made ex- 
periments only on cases of / = 1 because t/j’s can be treated independently. For 
all computational experiments, we used a PC with 700MHz AMD Athron CPU 
(ICPU) on which the Turbo Linux 4.2 operating system was running. 

First we made computational experiment on AND functions. We examined 
all combinations of n = 1000, m = 100,500,1000 and d = 2, 3,4, 5. For each 
case, we calculated the average CPU time required for inference and the ratio of 
successful executions of GREED Y1 over 100 trials, where we say that GREEDYl 
is successful if it outputs the correct set . . . , We modified GREEDYl 
so that it stopped after the d-th iteration. For each case, we generated input 
data and an AND function in the following way. First, we choose d different 
variables a;;^ , . . . , a;,^ randomly from a;i,...,a;„ and then we choose an AND 
function /i (a;,^ , . . . , a;,^) randomly from 2^^ possible AND functions (i.e., AND 
of literals). For all a;, (f = 1, . . . , n) and for all k (k — ^ m), we let Xi{k) — 0 

with probability 0.5 and Xi{k) — 1 with probability 0.5. For all k^ we let yi(k) — 
/i (a;jj (A;), . . . , Xi^{k)). The following table shows the result, where the percentage 
of the success ratio (%) and the average CPU time (sec.) are given for each case. 





d= 2 


d= 3 


d = 4 


d = 5 


m = 100 
m = 500 
m = 1000 


100% 0.006 
100% 0.113 
100% 0.280 


83% 0.014 
100% 0.165 
100% 0.399 


2% 0.022 
99% 0.218 
100% 0.540 


1% 0.030 
39% 0.274 
95% 0.659 



It is seen from this table that GREEDYl outputs the correct sets of variables 
with high probability for AND furrctions, arrd the probability irrcreases as m 
irrcreases. Note also that CPU time irrcreases rrear lirrear to d. Although the 
CPU time for m = 1000 is lorrger tharr twice of that for m — 500, it is still rrear 
lirrear to vn. 

Next, we nrade conrputatiorral experinrerrt orr gerreral Boolearr furrctiorrs of 
d irrput variables. Irr this case, we choose a Boolearr furrctiorr , . . . , a;,^) 

rarrdonrly fronr 2^ possible Boolearr furrctiorrs. The followirrg table shows the 
result, where the percerrtage of the success ratio (%) arrd the average CPU tinre 
(sec.) are giverr for each case. 





d= 2 


d= 3 


d = 4 


d — 5 


m = 100 
m = 500 
m = 1000 
m = 2000 


50% 0.007 
52% 0.111 
49% 0.278 
53% 0.612 


52% 0.017 
77% 0.177 
76% 0.424 
77% 0.945 


32% 0.024 
92% 0.238 
97% 0.576 
92% 1.238 


9% 0.035 
89% 0.304 
98% 0.747 
100% 1.642 



It is seerr that for sufficierrtly large m, the success ratio irrcreases as d irr- 
creases. It suggests that GREEDYl carr firrd the correct sets of variables rrot 
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only for AND/OR functions but also for most Boolean functions if d is not 
small. In the case of d = 2, the success ratio is around 50%. This is reasonable 
because the number of AND/OR functions with two input variables is 8, where 
there are 16 possible Boolean functions with two input variables. 

Next, we examined noisy case for combinations of d = 3,5, e = 0.1, 0.2, 0.3 
and n — 500, 1000,2000, where we used general Boolean functions. In this case, 
we let yj(k) ^ with probability e. The following table 

shows the result, where only the percentage of the success ratio (%) is given for 
each case. 





3 

e = 0.1 


d = 3 
£=0.2 


d= 3 
£ = 0.3 


d = 5 
£ = 0.1 


II 


d = 5 
£ = 0.3 


m = 500 


68% 


71% 


29% 


74% 


32% 


0% 


m = 1000 


71% 


72% 


60% 


89% 


78% 


12% 


m = 2000 


77% 


73% 


76% 


98% 


97% 


72% 



It is seen that the success ratio decreases as e increases. But, for cases of 
£ < 0.2 and m — 2000, the success ratios are close to those in the noiseless case. 
Therefore, it seems that GREEDY 1 works well even in the noisy environment if 
sufficiently large number of data are provided. 

Finally, we examined the effect of n to the CPU time, where we used d = 3, 
m — 1000 and e = 0. The following table shows the result, where only CPU 
times (sec.) are given. 



n = 1000 


n = 2000 


n = 3000 


n = 4000 


n = 5000 


n = 6000 


0.431 


0.870 


1.306 


1.732 


2.200 


2.635 



Fi'om this, it is seen that the CPU time is linear to n. Note that it took less 
than 3 seconds even in the case of n = 6000 (and m — 1000). 

The research group directed by the third author is now making biological 
experiments on Yeast genes using several hundreds of mutants in which some 
genes are modified. This corresponds to a case of n « 6000, m « 1000, and 
I « 6000. Therefore, it is expected that it will take 3 x 6000 seconds (since it 
takes about 3 seconds per yi) if GREEDY 1 is applied to real experimental data on 
Yeast genes. Now we are preparing to apply GREEDYl to analysis of Yeast genes 
since this computation time is not too long. Of course, real data are much more 
complex than artificial data used in the above. For example, instead of Boolean 
values, real numbers are used in real data. However, measurement errors are 
very large and thus real values should be classified into several discrete values. 
We are now seeking for appropriate threshold values for such classification. 

Note that even in the average case, GREEDYl sometimes fails to find the 
correct set of variables. However, it is not a crucial disadvantage in real applica- 
tions. In practice, it is not required that all the relations found by the inference 
algorithm are correct because correctness of the relations or hypotheses will be 
checked in detail by an expert and/or by other experiments. 
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Decision Tree GREEDY 1 




oooo oooo 



Fig. 3. Similarity and difference between the decision tree algorithms and GREEDYl. 
Different attributes can be selected at the nodes of the same height in decision trees, 
whereas the same attribute must be selected in GREEDYl. 



7 Concluding Remarks 



III this paper, we have shown an efficient iinpleinentation of a simple greedy al- 
gorithm for inferring functional relations. It is fast enough to be applied to large 
data sets. We have also shown an average case analysis of the greedy algorithm 
though the analysis was restricted to a very special case. However, results of 
computational experiments suggest that the greedy algorithm outputs correct 
solutions in many average cases. Therefore, further studies on the average anal- 
ysis might be possible and should be done. Of course, real data may be far from 
average case input data because input attributes are not necessarily indepen- 
dent. However, it is still guaranteed that GREEDYl outputs an approximate 
solution even in the worst case [3]. We are now trying to apply the greedy algo- 
rithm to analysis of real experimental data on genetic networks. Although this 
attempt is not yet successful, we successfully applied a similar greedy algorithm 
to classification of cancer cells using gene expression data [5]. This fact suggests 
that greedy type algorithms may be useful for analysis of real data. 

Finally, we would like to mention about the similarity between GREEDYl 
and the algorithms for constructing decision trees [12]. In decision tree algo- 
rithms, input data are partitioned into smaller blocks as descending the tree 
from the root to the leaves, and such criteria as the entropy score is used for 
selecting the attribute at each node. Recall that in GREEDYl, input data are 
partitioned into smaller blocks as the number of iterations increases, and the 
attribute which covers the largest number of pairs is selected in each iteration. 
Since the number of covered pairs can be considered as a kind of score, the most 
important difference lies in that different attributes can be selected at different 
nodes of the same height in the decision tree, whereas the same attribute must 
be selected at all the nodes of the same height in GREEDYl (see Fig. 3). There- 
fore, decision tree algorithms could be used for finding functional relations if we 
would put a restriction on the decision tree that the same attribute must be 
selected at all the nodes of the same hight. Since various techniques have been 
developed for decision trees, it would be interesting to apply these techniques 
for improving GREEDYl. 
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Abstract. Most of the relations are represented by a graph structure, 
e.g., chemical bonding, Web browsing record, DNA sequence. Inference 
pattern (program trace), to name a few. Thus, efficiently finding char- 
acteristic substructures in a graph will be a useful technique in many 
important KDD/ML applications. However, graph pattern matching is 
a hard problem. We propose a machine learning technique called Graph- 
Based Induction (GBI) that efficiently extracts typical patterns from 
graph data in an approximate manner by stepwise pair expansion (pair- 
wise chunking). It can handle general graph structured data, i.e., di- 
rected/undirected, colored/uncolored graphs with/without (self) loop 
and with colored/uncolored links. We show that its time complexity is 
almost linear with the size of graph. We, further, show that GBI can 
effectively be applied to the extraction of typical patterns from chemi- 
cal compound data from which to generate classification rules, and that 
GBI also works as a feature construction component for other machine 
learning tools. 



1 Introduction 

Data having graph structure are abound in many practical fields such as molecu- 
lar structures of chemical compounds, information flow patterns in the internet, 
DNA sequences and its 3D structures, and inference patterns (program traces of 
reasoning process). Thus, knowledge discovery from structured data is one of the 
major research topics in recent data mining and machine learning study. The ap- 
proach proposed by Agrawal and Srikant to mine sequential patterns was one of 
the initiating works in this field [Agrawal95] . Since then several approaches have 
been proposed from different angles for sequential or structural data. Mannila et 
al. proposed an approach to mine frequent episodes from sequences [Mannila97]. 
Shinatani and Kitsuregawa devised a fast mining algorithm for sequential data 
using parallel processing [Sintani98]. Srikant et al. used taxonomy hierarchy as 
background knowledge to mine association rules [Srikant97]. In this paper we 
focus on mining typical patterns in a graph structure data. By “typical” we 
mean frequently appearing subgraphs in the whole graph data. 

Conventional empirical inductive learning methods use an attribute-value 
table as a data representation language and represent the relation between 
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attribute values and classes by use of decision tree [QuinlanSG] or induction 
rules [Michalski90,Clark89]. Association rules [Agrawal94] widely used in data 
mining fall in this type of data representation. However, the attribute-value ta- 
ble is not suitable for representing a more general and structural data. Inductive 
logic programming (ILP) [Muggleton89] which uses the first-order predicate logic 
can represent general relationship in data. Further, ILP has a merit that it can 
encode domain knowledge in the same representation language and the acquired 
knowledge can be added and utilized as the background knowledge. However, it 
still has time complexity problem. We have explored a different approach called 
GBI (Graph Based Induction) by directly encoding the target relations in form 
of general graph. Its expressiveness stands between the attribute- value table and 
the first-order predicate logic. GBI is similar to SUBDUE [Cook94] which also 
tries to extract substructure in a graph. What differs most is that GBI can 
find multiple patterns whereas SUBDUE can find only one substructure that 
minimizes the description length of the total graph using a computationally con- 
strained beam search with a capability of inexact match. GBI is much faster 
and can handle much larger graphs. 

Finding typical patterns from the whole graph involves graph matching as a 
subproblem which is known to be very hard [Fortin96]. The approach taken by 
GBI is quite simple and is heuristic- based. It is based on the notion of pairwise 
chunking and no backtracking is made (thus approximate) . Its time complexity is 
almost linear with the size of the input graph. It can handle directed/undirected, 
colored/uncolored graphs with/without (self) loop and with colored/uncolored 
links. Some applications may require extracting patterns from a single huge 
graph of millions of nodes whereas some others from millions of small graphs 
of several tens of nodes. GBI works for both situations. Some applications may 
require only approximate solutions whereas some other exact solutions. As is 
evident, GBI gives only approximate solutions. 

In the following sections, we describe the method of Graph-Based Induc- 
tion (section 2), discuss the time complexity of GBI from both theoretical and 
experimental points of view (section 3), show how GBI is applied to chemi- 
cal compound analyses (section 4), and conclude the paper by summarizing the 
main contribution. 

2 Graph-Based Induction 

2.1 Work in the Past 

The idea of pairwise chunking dates back several years [Yoshida99]. It was orig- 
inally proposed in the study of concept learning from graph structure data 
[Yoshida95,Yoshida94] and the method was called GBI (Graph-Based Induc- 
tion). The central intuition behind is that a pattern that appears frequently 
enough is worth paying attention to and may represent an important concept 
(which is implicitly embedded in the input graph). In other words, the repeated 
patterns in the input graph represent typical characteristics of the given object 
environment. The original GBI was so formulated to minimize the graph size 
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by replacing each found pattern with one node that it repeatedly contracted the 
graph. The idea of pairwise chunking is shown in Figure 1. The graph size def- 
inition reflected the sizes of extracted patterns as well as the size of contracted 
graph. This prevented the algorithm from continually contracting, which meant 
the graph never became a single node. To reduce the complexity of search, the 
ordering of links is constrained to be identical if the two subgraphs are to match 
(meaning that the notion of isomorphism is relaxed), and an opportunistic beam 
search similar to genetic algorithm was used to arrive at approximate solutions. 
Because the primitive operation at each step in the search is to And a good set 
of linked pair nodes to chunk (pairwise chunking), we later adopted an indi- 
rect index rather than a direct estimate of the graph size to And the promising 
pairs. The new method was shown effective by applying it to build an on-line 
user adaptive interface [Motoda98]. However, in all of these, the type of graph 
allowed is restricted to basically a tree {i.e., a node can only have one outgoing 
link). 




Fig. 1. The idea of graph contraction by pairwise chunking 



2.2 Representation of Graph by Table 

In order to apply GBI to general graph structured data by removing the limita- 
tion on the type of graph and the constraint on the matching, we are currently 
representing a graph structured data using a set of tables that record the link 
information between nodes. For brevity we only explain the case of a directed 
graph with unlabeled links but the similar representation applies to a more gen- 
eral graph. A directed graph such as shown in Figure 2 can be represented using 
Table 1. For example, the first line in this table shows that the node No.l has 
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node name “a” and also has nodes No. 7 and No. 10 as its child nodes. Thereby, 
the restriction of the link ordering is no more imposed. 



Table 1. An example of table rep- 
resentation translated from a di- 
rected graph 




Node 

No. 


Node 

Name 


Child 
Node No. 


1 


a 


7 10 


2 


b 


7 


3 


d 


8 11 


4 


b 


8 


5 


a 


9 


6 


b 


9 


7 


d 


10 


8 


b 


11 12 


9 


b 


12 


10 


a 


11 


11 


b 




12 


c 





2.3 Basic Algorithm 

The stepwise pair expansion (pairwise chunking) is performed by repeating the 
following three steps. The extracted chunks represent some characteristic prop- 
erties of the input data. 

51. If there are patterns identical to the chunked pattern in the input graph, 
rewrite each of them to a single node of the same new label. 

52. Extract all linked pairs in the input graph. 

53. Select the most typical pair among the extracted pairs and register it as the 
pattern to chunk. 

Each time we perform the pairwise chunking, we keep track of link infor- 
mation between nodes in order to be able to restore the original graph (s) or 
represent the extracted patterns in terms of the original nodes. This is real- 
ized by keeping two kinds of node information: “child node information” (which 
node in the pattern the link goes to) and “parent node information” (which 
node in the pattern the link comes from). These two kinds of information are 
also represented by tables (not shown here). Chunking operation can be handled 
by manipulating these three tables. The basic algorithm is shown in Eigure 3. 
Currently we use a simple “frequency” of pairs as the evaluation function to use 
for stepwise pair expansion. Note that self-loop distinction flag is necessary to 
distinguish whether the parent and the child are the same node or not when 
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Fig. 3. Algorithm of stepwise pair expansion 

their labels are the same in counting the number of pairs. Further note that the 
algorithm can naturally handle a graph with loop substructures inside. 

In order to apply the algorithm to undirected graphs, undirected graphs are 
converted to directed graphs by imposing a certain fixed order to node labels. 
For example, by ordering node labels as a — > 6 — > c, the graph on the left in 
Figure 4 is uniquely converted to the directed graph on the right. 



ex.) order of node labels: a^b^c^d 




Fig. 4. An example of conversion from undirected graph to directed graph 



3 Performance Evaluation 

Let N, L, P, C respectively denote the total number of nodes in the graph, the 
average number of links going out of one node, the number of different kinds 
of pairs in the graph, the number of different kinds of chunked patterns derived 
from the graph data. The time complexity to read the input data represented by 
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the table form as shown in the previous section is 0{NL), because the program 
must read all the link information in the graph. The time complexity to count 
the number of pairs for each kind is 0{NL), because the program must search 
all the links in the graph and the total number of links in the graph is NL. The 
time complexity to select the pair to be chunked is 0{P), because the program 
must find the most frequent pair by scanning all the pair information. The time 
complexity to perform the pairwise chunking is 0{NL), because the program 
must search all the links in the graph. The time complexity to update the pair 
information is 0{P), because the program must search all kinds of pairs in the 
graph. The program repeats the above process until the total number of chunked 
patterns becomes C . Therefore, the total time complexity is 0{NL+NL+C{P+ 
NL + P)) = 0{CP + NL). 

We have confirmed this experimentally. Figure 6 shows the computation time 
(by a machine with Pentium II 400MHz CPU and 256MByte Memory) for ar- 
tificially generated graphs with random structure: average number of outgoing 
links 3 and 5, and size of the graph 100 to 10,000 (See Figure 5 for an example 
graph). Figure 7 shows the computation time for graphs with random structure 
of fixed size (200 nodes): the outgoing link existence probability is changed from 
10% to 100% and the number of node labels is changed from one to three. In all 
cases chunking was terminated when the maximum number of the pairs becomes 
less than 4% of total nodes in the graph. 

As predicted, it is found that the computation time increases almost linearly 
with the size of the graph (number of nodes and links in the graph). It takes more 
time for the graphs with fewer node labels when the graph size is equal because 
the same patterns appear more often in the graphs with fewer node labels and 
there are more chances of pairwise chunking. It takes more time for the graphs 
with more links going out from each node when the graph size is equal because 
there are more candidate pairs to be chunked. 




number of node label : 1 



number of node label : 5 



average number of links from one node : 3 average number of links from one node : 3 



Fig. 5. Examples of graph data for experimental evaluation 




Graph Based Induction for General Graph Structured Data 



105 




0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 



Number of nodes 




Fig. 6. Computation time v.s. num- Fig. 7. Computation time v.s. 

her of nodes number of links 

4 Extracting Patterns from Chemical Componnd Data 

4.1 Application to Carcinogenicity Data 

The carcinogenesis prediction is one of the crucial problems in the chemical 
control of our environments and in the industrial development of new chemi- 
cal compounds. However, the experiments on living bodies and environments to 
evaluate the carcinogenesis are quite expensive and very time consuming, and 
thus it is sometimes prohibitive to rely solely on experiments from both eco- 
nomical and efficiency point of view. It will be extremely useful if some of these 
properties can be shown predictive by the structure of the chemical substances 
before being actually synthesized. 

We explored the possibility of predicting chemical carcinogenicity using our 
method. The task is to find structures typical to carcinogen of organic chlorides 
comprising C, H and CL The data were taken from the National Toxicology Pro- 
gram Database. We used the same small dataset that was used in [Matsumoto99] 
in which typical attributes representing substructure of the substances were sym- 
bolically extracted and used as inputs to a neural network by which to induce a 
classifier. The data consists of 41 organic chlorides out of which 31 are carcino- 
genic (positive examples) and 10 non- carcinogenic (negative examples). There 
are three kinds of links: single bonding, double bonding and aromatic bonding. 
Several examples of the organic chlorine compounds that have carcinogenicity 
are shown in Figure 8. 

We treated the carbon, chlorine and benzene ring as distinctive nodes in 
graphs and ignored the hydrogen for our initial analyses. Further, we treated 
the single bond, double bond, triple bond and bond between benzene rings as 
links with different labels. Figure 9 shows an example of the conversion from 
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the organnochlorine compound to the graph structured data. Each compound is 
associated with a LogP value (a measure of hydrophobicity) . GBI can’t handle 
this value and thus, we ignored it. Computation time was not an issue for the 
graphs of sizes less than several tens nodes for GBI. It ran in seconds. 




Fig. 8. Examples of carcinogenetic Fig. 9. Conversion to graph 

organic chlorine compounds structured data 



Figure 10 shows patterns extracted from the positive cases and those ex- 
tracted from the negative cases. By comparing these two sets of patterns, we 
can derive useful rules from the patterns which appear only in either positive 
patterns or negative patterns. A few examples are shown in Figure 11. 




Positive patterns 



Negative patterns 



Fig. 10. Example patterns extracted from positive and negative examples 



4.2 Application to Mutagenicity Data 

Some chemical compounds are known to cause frequent mutations which are 
structural alterations in DNA. Since there are so many chemical compounds, 
it is impossible to obtain mutagenicity data for every compound from biolog- 
ical experiments. Accurate evaluation of mutagenic activity from the chemical 
structure (structure-activity relationships) is really desirable. Furthermore, the 
mechanism of mutation is extremely complex and known only in part. Some 
evidence supports the existence of multiple mechanistic pathways for different 
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Carcinogenic 



Support:22.0%, Confidence:90.0% Support:12.1%, Confidence:83.3% 





Non Carcinogenic 



Support:2.4%, Confidence:100.0% 



Fig. 11 . Example rules derived from chemical carcinogenesis data 



classes of chemical compounds. If this leads to a hypothesis for the key step in 
the mechanisms of mutation, it will be very important in mutagenesis research. 

The data were taken from [Debnath90]. This data contains 230 aromatic 
or heteroaromatic nitro compounds. Mutagenesis activity was discretized into 
four categories: Inactive: activity = —99, Low: —99 < activity < 0.0, Medium: 
0.0 < activity < 3.0, High: 3.0 < activity. By this categorization, we can classify 
the above compounds into 22 Inactive cases, 68 Low cases, 105 Medium cases and 
35 High cases. The percentages of the classes of high, medium, low and inactive 
are 15.2%, 45.7%, 29.5% and 9.6% respectively. Each compound is associated 
with two other features:LogP and LUMO (a property of electric structure). We 
ignored these two values as before. We treated the single bond, double bond, 
triple bond and aromatic bond as different link labels, the carbon, chlorine and 
nitrogen as distinctive nodes but ignored the hydrogen as before. Furthermore, 
artificial links are added from each node to the other nodes where the number 
of links between the two is 2 to 6. This is to emulate variable nodes and links 
with (wildcard). 

In this analysis, we used GBI as a feature construction tool as well as using it 
as a pattern extractor from which to generate classification rules. These patterns 
and the LogP and LUMO are used as a set of features for the decision tree 
learner C4.5 [Quinlan93]. Each found pattern is evaluated and ranked by the 
following measure (Eq. 1), which is the maximum relative class ratio of the four 
classes. Here, i, /, m, h stands for the number of compounds for each class which 
has this pattern as a subgraph, and I, L, M, H stands for the original number of 
compounds for each class. 




Patterns of the top n (n = 10, 20, 30, 40, 50 with no artificial links and n= 
10, 20, 30, 40, 50, 60, 70, 80, 90,100 with artificial links) in the ranking are used 
as pattern features for C4.5. 

Figures 12, 13, 14 and 15 shows how the frequency count of the best pair 
to chunk, the value of the evaluation measure of the chunked pattern, its size 
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in terms of the number of nodes change as the chunking proceeds. As each of 
the linked pair nodes has a different chunking history, the latter two are not 
monotonic with chunking steps. Table 4.2 shows the prediction error of C4.5 for 
various parameter settings. “All” means that all chosen chunked patterns and 
both LogP and LUMO are used as the attributes. “Without cross validation 
(w/o CV)” means that the error is measured for all the data from which the tree 
is induced (training data). From this it is shown that the most decisive parameter 
is LogP (30.4%). But by choosing a good set of patterns as attributes, its effect 
is about the same as LogP (24.8%). Using all reduces the error to 13.0% for w/o 
CV. In all cases, the error of 10 fold cross validation is very large and the problem 
is felt very hard. The predictive error of 10 fold cross validation using all the 
chunked patterns by the method described in [MatsudaOO] results in 52.6%. In 
this method, each pattern is converted to a rule using the same measure (Eq. 1) 
to assign a class. Rules are applied in increasing order of their support, and if 
the support is the same, rules of more complex pattern are applied first. The 
result is consistent with Table 4.2. One possible reason for this poor predictive 
capability is that the patterns are more descriptive than discriminative. This is 
not because the method is not working well, but because no classifier can result 
in a better predictive error. This means that the attributes we have prepared 
are not sufficient enough and we need some other features {e.g., 3-D structural 
information) to improve classification power. Some examples of the extracted 
rules are shown in Figure 16. We could not find the wildcard expression for a 
meaningful rule. This is because we neglected hydrogen. 




Fig. 12. Max. no. of pairs and class 
predictive evaluation measure vs. 
chunk steps without artificial links 




Fig. 14. Max. no. of pairs and class 
predictive evaluation measure vs. 
chunk steps with artificial links 




Fig. 13. Chunk size (no. of 
nodes) vs. chunk steps without 
artificial links 




Fig. 15. Chunk size (no. of 
nodes) vs. chunk steps with ar- 
tificial links 
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Table 2. Prediction error (%) by C4.5 with and without constructed pattern attributes, 
LogP and LUMO for mutagenicity data: Upper (without artificial finks), Lower (with 
artificial finks) 



No. of 


all 


w/o LUMO 


w/o LogP 


patt. 


only 


w/o patt. 


attr. 


w/o cv 


10 fcv 


w/o cv 


10 fcv 


w/o cv 


10 fcv 


w/o cv 


10 fcv 


w/o cv 


10 fcv 


10 


15.7 


44.8 


20.9 


43.9 


25.7 


47.4 


40.4 


44.8 


27.8 


40.9 


20 


14.8 


45.6 


18.3 


43.0 


23.0 


50.4 


35.7 


45.6 


only LUMO 


30 


13.9 


44.8 


17.8 


41.3 


21.3 


49.6 


35.7 


47.4 


44.8 


47.8 


40 


13.9 


39.6 


20.0 


41.3 


20.4 


51.8 


30.9 


49.1 


only LogP 


50 


15.2 


47.0 


19.6 


47.0 


20.9 


47.8 


29.6 


49.1 


30.4 


47.4 


No. of 


all 


w/o LUMO 


w/o LogP 


patt. 


only 






attr. 


w/o cv 


10 fcv 


w/o cv 


10 fcv 


w/o cv 


10 fcv 


w/o cv 


10 fcv 






10 


21.3 


44.8 


29.1 


52.2 


30.9 


46.5 


48.3 


51.3 






20 


17.0 


41.7 


23.9 


49.1 


25.7 


47.0 


38.2 


55.23 






30 


18.3 


48.7 


21.3 


52.6 


22.2 


48.7 


33.9 


48.7 






40 


17.4 


49.1 


22.2 


52.6 


23.5 


50.9 


31.7 


50.9 






50 


14.3 


47.4 


20.4 


51.7 


17.8 


48.7 


30.4 


50.9 






60 


13.0 


47.4 


19.6 


53.1 


16.5 


46.5 


28.3 


52.2 






70 


13.5 


43.9 


20.4 


49.6 


18.3 


44.8 


28.3 


47.4 






80 


13.5 


50.4 


18.3 


45.6 


17.0 


50.4 


26.1 


46.1 






90 


14.8 


49.1 


18.7 


47.8 


17.8 


50.0 


26.1 


45.2 






100 


14.3 


46.1 


18.3 


46.9 


17.8 


46.9 


24.8 


45.2 









Inactive : 


sup=0.9% 


conf=40% 


”v 




Inactive : 


sup=0.0% 


conf= 0% 


Low : 


sup=1.3% 


conf=60% 




Low : 


sup=2.2% 


conf=100% 


Medium : 
High : 


sup=0.0% 

sup=0.0% 


conf= 0% 
conf= 0% 






Medium : 
High : 


sup=0.0% 

sup=0.0% 


conf= 0% 
conf= 0% 


Inactive : 
Low : 


sup=0.0% 

sup=0.0% 


conf= 0% 
conf= 0% 


( 


L/ ^ 


Inactive : 
Low : 


sup=0.0% 

sup=0.0% 


conf= 0% 
conf= 0% 


Medium : 


sup=3.4% 


conf=I00% 






Medium : 


sup=0.4% 


conf=25% 


High : 


sup=0.0% 


conf= 0% 




High : 


sup=1.3% 


conf=75% 



Fig. 16. Example rules derived from chemical mutagenicity data 



5 Conclusion 

In this paper, we showed how we can expand the capability of the Graph-Based 
Induction algorithm to handle more general graphs, i.e., directed graphs with 
1) multiple inputs/outputs nodes and 2) loop structure (including a self- loop). 
The time complexity of the implemented program was evaluated from both the- 
oretical and experimental points of view and it was shown that the algorithm 
runs almost linearly to the graph size. The algorithm was applied to two kinds of 
chemical compound data (carcinogenicity data and mutagenicity data) in order 
to extract useful patterns. It was also shown that GBI can be effectively used 
as a feature construction component of other machine learning method, which 
enables combined use of other non-structural features and structural features. 
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Ongoing and future work includes 1) deciding the rational evaluation method 
for derived rules in chemical compound data, 2) investigating the sensitivity 
of chunk ordering, 3) using statistical index {e.g. Gini Index [Breiman84]) or 
the description length {e.g. [Cook94] in stead of the simple “frequency” as the 
evaluation function for stepwise expansion and 4) introducing a new index which 
corresponds to the notion of “similarity” of human concept. 
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Abstract. We attempt to extract characteristic expressions from liter- 
ary works. That is, our problem is, given literary works by a particular 
writer as positive examples and works by another writer as negative exam- 
ples, to hnd expressions that appear frequently in the positive examples 
but do not so in the negative examples. It is considered as a special case 
of the optimal pattern discovery from textual data, in which only the 
substring patterns are considered. One reasonable approach is to create 
a list of substrings arranged in the descending order of their goodness, 
and to examine a hrst part of the list by a human expert. Since there 
is no word boundary in Japanese texts, a substring is often a fragment 
of a word or a phrase. How to assist the human expert is a key to suc- 
cess in discovery. In this paper, we propose (1) to restrict to the prime 
substrings in order to remove redundancy from the list, and (2) a way 
of browsing the neighbor of a focused string as well as its context. Using 
this method, we report successful results against two pairs of anthologies 
of classical Japanese poems. We expect that the extracted expressions 
will possibly lead to discovering overlooked aspects of individual poets. 



1 Introduction 

Analysis of expressions is one of the most fundamental methods in literary stud- 
ies. In classical Japanese poetry Waka, there are strict rules in the choice and 
combination of poetic words. For instance, the word “Uguisu” (Japanese bush 
warbler) should be used linked with the word “Ume” (plum-blossom). It is sig- 
nificant in Waka studies, therefore, to consider how poets learned such rules and 
developed their own expressions. We would like to investigate how they learned 
certain expressions from their predecessors. 

From this point of view, we have established a method that automatically 
extracts similar poems in expression from Waka database [12,11]. Using this 
method, we discovered affinities of some unheeded poems with some earlier 
ones, and this raised an interesting issue for Waka studies and we could give 
a convincing conclusion to it. 
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— We have proved that the poem (As if it were in darkness, / My parental heart 
/ Is blind and lost in / The ways of caring about my child) by Fujiwara-no- 
Kanesuke, one of the renowned thirty-six poets, was in fact based on a model 
poem found in Kokinshu. The same poem had been interpreted just to show 
“frank utterance of parents’ care for their child.” Our study revealed the 
poet’s techiniques in composition half hidden by the heart-warming feature 
of the poem by extracting the same structure between the two poems. 

— We have compared Tametadashu, the mysterious anthology unidentified in 
Japanese literary history, with a number of private anthologies edited after 
the middle of the Kamakura period (the thirteenth-century) using the same 
method, and found that there are about 10 pairs of similar poems between 
Tametadashu and Sokonshu, an anthology by Shbtetsu. The result suggests 
that the mysterious anthology was edited by a poet in the early Muromachi 
period (the fifteenth-century). There have been surmised dispute about the 
editing date since one scholar suggested the middle of Kamakura period as 
a probable one. We have had a strong evidence about this problem. 

While continuing to find affinities among Waka poems, it is also necessary 
to give some additional conditions to the method when we compare poets in 
parent-child or teacher-student relationships. It is easily inferred that poet A 
(the master poet) greatly influences poet B (the disciple poet), and that the 
poems by poet B may have a lot of allusions to those by poet A. In fact, many 
scholars have already noticed such apparent literary relationships. In such cases, 
it is much more important to clarify the differences than to enumerate affinities. 
For example, when poet B hardly adopts the expressions frequently used by 
poet A, it will give us a chance to study their relationship in a different way. 
In this paper, we will compare two poets’ private anthologies and try to make 
clear the differences and features in the similar expressions on the basis of their 
frequencies. This is in fact derived from the same methodology we have been 
practicing so far. 

Shimozono et al. [10] formulated the problem of finding good pattern that 
distinguishes two sets Pos and Neg of strings (called positive and negative ex- 
amples, respectively) as an instance of optimized pattern discovery, originally 
proposed by Fukuda et al. [6]. The problem is: 

Given. Two finite sets Pos and Neg of non-empty strings. 

Find. A pattern tt that minimizes a statistical measure function G(7t; Pos, Neg). 

For instance, the classification error, the information entropy, the Gini index, 
and the index are used as a statistical measure function G. Note that these 
measures are functions of 4-tuple (pi, ni,poj ^o), where pi and ni are the num- 
bers of strings in Pos and Neg that contain tt, respectively, and po = jPosj — pi 
and no = \Neg\ — n\. Namely, the goodness of a pattern depends only on its 
frequencies in Pos and Neg. Shimozono et al. [10] showed an efficient algorithm 
for proximity word-association patterns. Although only the classification error 
measure is dealt with in [10], the algorithm works for many other measures [1]. 
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Our problem is regarded as a special case of this problem. That is, we deal 
with only the patterns of the form *w*, where * is a wildcard that matches any 
string and ic is a non-empty string. We call such patterns the substring patterns. 
We have a trivial 0{m) time and space algorithm since there exist essentially 
0{m) candidates for the best substring w, where m is the total length of the 
strings in S' = Pos U Neg, and therefore difficulty lies mainly on how to find an 
appropriate goodness measure. 

However, it seems that no matter what measure we use, most of ‘good’ pat- 
terns are not so good in practice; they are obvious and worthless in many cases. 
For this reason, discovery requires an effort by domain experts to examine an 
upper part of list of patterns arranged in the decreasing order of the goodness. 
This corresponds to the step of interpreting mined patterns, a ‘postprocessing’ of 
data mining in knowledge discovery process [5] . We believe that how to support 
in this step the domain experts is a key to success. In this paper we tackle this 
problem with the weapon of stringology. 

Let Sub{S) be the set of substrings of strings in S. All the strings in Sub{S) 
could be candidates for characteristic expressions. The candidate strings, how- 
ever, are not independent each other in the sense that some strings subsume 
other ones. Moreover, we are frequently faced with the case that two strings in 
the superstring-substring relation have the same frequency, and therefore have 
the same value of goodness. (Recall that the goodness measures considered in this 
paper depend only on the frequencies in positive and negative examples.) For in- 
stance, in the first eight imperial anthologies (from Kokinshu to Shinkokinshu) , 
every occurrence of the string “shi-no-no-ya” (consisting of 4 syllables) is a 
substring of “yo-SHI-no-no-ya-ma” (6 syllables; Mt. Yoshino). In such case we 
want to remove shorter strings. 

On the other hand, researchers are interested in not just frequency of a 
word but in its actual use. That is, they would access the context of each word 
appearance. In addition, a candidate string is often a fragment of a word or a 
phrase and seems to be meaningless. In order to find the word or the phrase that 
contains this fragment as a substring, the researchers need to check what string 
immediately precedes (follows) this fragment, for every occurrence of it. Thus, 
it is necessary for the reseachers to see the possible superstrings of a focused 
candidate string. 

In order to introduce a structure into Sub{S), we use the equivalence relation, 
first defined by Blumer et al. [2], which has the following properties: 

— Each equivalence class has a unique longest string which contains any other 
member as a substring, and we regard it as the representative of the class. 

— All the strings in an equivalence class have the same frequency in S (and 
therefore have the same value of goodness.) 

— The number of equivalence classes is linear with respect to the total length 
of the strings in S. 

A string in Sub{S) is said to be prime if it is the representative of some equiv- 
alence class under this equivalence relation. The basic idea is to consider only 
the prime substrings as candidates for characteristic expressions. Any non-prime 
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substring with a high goodness value can be found as a part of the prime sub- 
string equivalent to it, even though we remove all the non-prime substrings from 
the list. 

It should be stated that the data structure called the suffix tree [3] exploits 
a similar and more popular equivalence relation, which also satisfies the above 
three conditions. This equivalence relation is finer than the one by Blumer et al. 
[2], and therefore includes much more equivalence classes. From the viewpoint 
of computational complexity, this is not a significant difference because the two 
equivalence relations both satisfy the third condition. However, the difference is 
crucial for researchers who must check the candidate strings. In fact the number 
of equivalence classes was reduced to approximately 1/4 by using the one by 
Blumer et al. in our experiment using Waka poems, as will be shown in Section 5. 

We would create a candidate list only of prime substrings. To support the 
domain experts who inspect the list, we develop a kind of browser to see: 

— The non-prime substrings equivalent to each prime substring. 

— The superstring-substring relationships between the prime strings. 

The symmetric compact DAWG [2] for S is useful for this purpose. It is a directed 
acyclic graph such that the vertices are the prime substrings, and the labeled 
edges represent: What happens when appending a possible letter to the left or right 
end of a prime substring? For instance, in the first eight imperial anthologies, 
appending a single letter “ya” to the right end of the string “yo-SHI-no-no” 
yields “yo-SHI-no-no-ya-ma.” In the case of another letter “ha,” it yields the 
string “mi-yo-SHI-no-no-ha-na.” (Notice that not only “ha-na” is appended 
to the right, but also “mi” is appended to the left.) On the other hand, the result 
of appending “mi” to the left of the same string is simply “mi-yo-SHI-no-no.” 
We would consider to draw interactively a subgraph of the symmetric com- 
pact DAWG whose nodes are limited to the ones reachable from/to a focused 
node (namely, they are substrings or superstrings of the focused string). This 
subgraph, however, is rather complicated for a reasonable size S in the real 
world. For this reason, we shall instead draw: 

— The subgraph consisting of the prime substrings that are substrings of the 
focused one. (It is small enough because the focused string is rather short in 
practice.) 

— The right and left context trees which represent the same information as 
the subgraph consisting of the prime substrings that are superstrings of the 
focused one. 

The right context tree of a prime substring x is essentially the same as the subtree 
of the node x in the suffix tree [3] for S', but augmented by adding to every node 
xy a label of “7[a;]?/” such that ^xy is the prime substring that is equivalent to 
xy. The left context tree is defined in a similar way. 

Most of the existing text analysis tools have the KWIG (Key Word In Gon- 
text) display [8]. The left and the right context trees of a focused prime substring 
enable researchers to grasp the context of the string more quickly compared with 
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KWIC, especially when the number of appearances of the string being checked 
is relatively large. 

Using the suggested method, we extracted the strings that differentiate two 
anthologies, scholarly scrutinized them, and succeeded in finding the character- 
istic expressions. We compared Sankashu by Saigyb with Shugyokushu by Jien, 
and Shuigusb by Fujiwara-no-Teika with Tameieshu by his son, Fujiwara-no- 
Tameie separately, and procured such expressions highlighting differences be- 
tween each two anthologies. 

Saigyb and Jien were both priests, but their lives were so contrastive in status 
and social circumstances. Being a son of the famous regent, Jien could not make 
a break with the government. However, it is well known that Jien sought much 
help from Saigyb in his last days not only about poetic composition but also 
about the way of life. For instance, he confessed his plan of becoming a hermit. 

On the other hand, Tameie was rigorously trained by his own father, Teika, 
for he was in the direct descent of the Mikohidari dynasty famous for producing 
poets. The number of Teika’s poems Tameie referred to is not small. 

In each pair of anthologies, there necessarily exist similar poems. However, as 
we have demonstrated so far, we could collect certain differences in expressions. 
This, we expect, will possibly lead to discovering overlooked aspects of individual 
poets. 

It may be relevant to mention that this work is a multidisciplinary study 
between the literature and the computer science. In fact, the second author 
from the last is a Waka researcher and the last author is a linguist in Japanese 
language. 

2 Substring Statistics for Japanese Literary Studies 

If we want to use a word statistics, we need to perform a task of word segmenta- 
tion, because there is no word boundary in texts written in Japanese, like other 
non- Western languages. This is a difficult and time-consuming task. For exam- 
ple, it is reported in [9] that it took eight years to build a part-of-speech tagged 
corpus of “the Tale of Genji.” In the case of Waka poetry, building such corpora 
is more difficult because the frequently used technique, called “Kake-kotoba,” ^ 
which exploits the ambiguities of a word or part of word. We can say that a 
unique word segmentation is essentially impossible in such a situation. 

Recently some researchers noticed that using substring statistics instead of 
word statististics is in fact useful in expression analysis, although a substring 
could be a fragment of a word or a phrase. For example, Kondo [7] proposed 
an expression analysis method based on n-gram statistics, and reported some 
differences between the expressions used by poets and poetess in Kokinshu, the 
most famous imperial anthology. The n-gram statistics is essentially the same as 

^ A sort of homonymic punning where the double meaning of a word or part of word is 
exploited. In English it is usually called the pivot words because it is used as a pivot 
between two series of sounds with overlapping syntactical and semantic patterns. 
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the substring statistics since n cannot be fixed. This work opened the door for 
the application of substring statistics to Japanese literary works. 

However, Kondo restricted herself to the substrings which ( 1) are of length 
from 3 to 7, (2) occur more than once, and (3) are used only by male (not used 
by female), to ease the burden. In this paper, we will show that such a restriction 
can be removed by exploiting combinatorial properties on strings. 



3 Prime Substrings 

In this section, we give a formal definition of the prime substrings, and present 
some of their properties. 



3.1 Preliminary 

Let If be a finite alphabet. An element of S* is called a string. Strings x, y, and 
z are said to be a prefix, substring, and sujfix of the string u = xyz, respectively. 
A string u is said to be a superstring of a string y if y is a substring of u. The 
length of a string u is denoted by |t6|. The empty string is denoted by e, that 
is, |e| = 0. Let A+ = S* — {e}. The ith symbol of a string u is denoted by u[i] 
for 1 < i < |m|, and the substring of a string u that begins at position i and 
ends at position j is denoted by u[i : j] for 1 < i < j < |u|. For convenience, 
let u[i : j] = e for j < i. Let Sub{w) denote the set of substrings of w. Let 
Sub{S) = Sub{w) for a set S of strings. For a set S of strings, denote by 

II S'!! the total length of the strings in S, and denote by |S'| the cardinality of S. 



3.2 Definition of Prime Substrings 

Definition 1. Let S be a non-empty finite subset o/If+. For any x in Sub{S), 
let 



Beginposg{x) = {{w,j) \ w e S,0 < j < |w|,x = w[j + 1 : j + |x|]}, 
Endposgix) = {{w,j) \ w e S,0 < j < |w|,x = w[j - |x| + 1 : j]}. 

For any x ^ Sub{S), let Beginposg{x) = Endposg{x) = 0. 

For example, if A = {babbc, ababb} , then Beginposg{a) = Beginposg{ab) = 
{{babbc,!), (ababb,0), (ababb,2)}, Beginposg{c) = {{babbc, 4)}, and Endposg{bb) = 
Endposg(abb) = Endposg{babb) = {{babbc, 4), {ababb, 5)}. 

From here on, we omit the set S, and write simply as Beginpos and Endpos. 

Definition 2. Letx and y be any strings in E* . We write as x =l y if Beginpos{x) 
Beginpos{y), and write as x =r y if Endpos{x) = Endpos{y). The equivalence 
class of a string x G E* with respect to =l (resp. =r) is denoted by [x]=i^ (resp. 

\xhu)- 
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If S' = {babbc, ababb} , then [e]=i = [£]=r = {e}, [o]=i, = {a,ab}, [bb]=^ = 
{bb, abb, babb}, and [c]=j, = {c, be, bbc, abbe, babbc}. 

Note that all strings that are not in Sub{S) form one equivalence class under 
=L {=r). This equivalence class called the degenerate class. All other classes are 
called nondegenerate. 

It follows from the definition of =l that, if x and y are strings in the same 
nondegenerate class under =l, then either x is a suffix of y, or vice versa. There- 
fore, each nondegenerate equivalence class in has a unique longest member. 
Similar discussion holds for 

Definition 3. For any string x in Sub{S), let it and tc denote the unique 
longest members of[x]=^ and [x]=j,, respectively. 

For any string x in Sub{S), there uniquely exist strings a and (3 such that 
tic = ax and it = xj3. In the running example, we have ^ = e, it = ab, 

b = b, %b = babb, bm = babb, and Tt = babbc. 

Figure 1 shows the suffix tree [3] for S = {babbc, ababb}. Note that the nodes 
of the suffix tree are the strings with x = it. On the other hand. Fig. 2 shows 




Fig. 1. Suffix tree for S — {babbc, ababb} . 



the directed acyclic word graph (DAWG for short) [3] for S. Note that the nodes 
are the nondegenerate equivalence classes in =/j. The DAWG is the smallest 
automaton that recognizes the set of suffixes of the strings of S if we designate 
some nodes as final states appropriately. 

Definition 4. For any string x in Sub{S), let lit be the string ax (3 such that 
a and (3 are the strings satisfying tx = ax and it = x(3. 

In the running example, ^ = e, H = ab, b = b, ab = ab, abl) = babb, 
tai = babb, and It = babbc. 

Definition 5. Strings x and y are said to be equivalent on S if and only if 

1. X ^ Sub{S) and y ^ Sub{S), or 

2. x,y € Sub{S) and It = ^ . 
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Fig. 2. DAWG for S = {babbc, ababb}. 



This equivalence relation is denoted by x = y. The equivalence class of x under 
= is denoted by [x]=. 

Notice that, for any string x in Sub{S), the string ^x^ is the longest member of 
[x]=. Intuitively, ^ = ax/3 means that: 

— Every time x occurs in S, it is preceded by a and followed by ft. 

— Strings a and /3 are as long as possible. 

Now, we are ready to define the prime substrings. 

Definition 6. A string x in Sub{S) is said to be prime if^ = x. 

Lemma 1 (Blumer et al. (1987)). The equivalence relation = is the transitive 
closure of the relation =r U =l. 

It follows from the above lemma that = ('af) = j for any string x in 
Sub{S). 

3.3 Properties of Prime Substrings 

Recall that the number of all substrings of S is 0(||S'|p). This can be reduced to 
0(||5'||) by considering the substrings x such that “x^ = x. In fact the suffix tree 
is a data structure that exploits this property. Similarly, the DAWG achieves its 
0(||5'||) space complexity by identifying every substring x with the substring *x . 
Since the number of prime substrings of S is also 0(||S'||), it seems that there is 
no advantage in considering only the prime substrings. In practical application, 
however, it has a big advantage because the users do not have to examine non- 
prime substrings. The next lemma gives more tight bounds. 

Lemma 2 (Blumer et al. (1987)). Assume US'!! > 1. The number of the 
nondegenerate equivalence classes in=L (=r) is at most 2||S'|| — 1. The number 
of the nondegenerate equivalence classes in = is at most IIS’!! -I- [S'!. 

Thus, the number of prime substrings of S is at most US'!! -I- [S']. In practice, 
the number of prime substrings is usually smaller than both the number of 
substrings x with if = x and the number of substrings x with ^ = x, which 
are upper-bounded by 2||S'|| — 1. 

Let Prime{S) be the set of prime substrings of S, i.e. Prime(S) = \ x € 

Sub{S)}. 
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Definition 7. The symmetric compact DAWG for S is the triple {V,El,Er) 
where V = Prime{S) is the set of vertices, and El,Eh C V x V x 27 + are two 
kinds of labeled edges defined by: 

El = {(a;, ^<JxS, 'jcr) j x G V, a G G E* ,^ax5 = trS} 

Er = {(x, Sxa^, (77) \ X gV,g G E,5,^ G S*,6xaj = 

The compact DAWG for S is the graph (V, Er). 

Figure 3 show the compact DAWG for the running example. Gompared with 
the suffix tree of Fig. 1 and with the DAWG of Fig. 2, the nodes represent only 
the prime substrings, and therefore it is expected that the number of nodes are 
smaller than those of the suffix tree and the DAWG. 

Figure 4 shows the symmetric compact DAWG for the running example. 




Fig. 3. Compact DAWG for S = {babbc, ababb}. The edge labeled by cry from the node 
X to the node y corresponds to the fact that xa is equivalent to y under =, where 
x,y G Primers), ct £ 27, 7 € 27*. For instance, the arrow labeled “afefe” from the node 
to the node “babb” means that ba is eqnivalent to babb under =. 




Fig. 4. Symmetric compact DAWG for S = {babbc, ababb} . The solid and the broken 
arrows represent the edges in Er and El, respectively. 



Lemma 3 (Blumer et al. (1987)). Both the compact DAWG and the sym- 
metric compact DAWG for S can be constructed in 0(||5'||) time and space. 
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4 Browser for Finding Characteristic Snbstrings 

We would find ‘good’ patterns from text strings in the following manner. 

1. Choose an appropriate measure of the goodness of the patterns which de- 
pends only on their frequencies. 

2. Compute the goodness of all possible patterns, and create a list of patterns 
arranged in the decreasing order. 

3. Evaluate an upper part of the list by a domain expert. 

However, most of patterns in the upper part of the list are not so good in 
practice. They are obvious and/or worthless in many cases. For this reason, it is 
most important to develop an effective way of supporting the domain expert in 
the third step. 

The patterns we are dealing with are restricted to the substring patterns, 
which are of the form *w*, where * is the wildcard that matches any string in 
S* and ic is a non-empty string in Since we assume that the goodness of a 
substring pattern depends only on its frequencies, we can restrict w to a prime 
substring. Thus, we exclude the non-prime substrings from the list of candidates 
for characteristic expressions. There is no risk of overlooking any good non-prime 
substring x because it must appear as a substring in the prime substring with 
the same value of goodness. 

To support the expert we develop a browser to see the following information. 

— Among the substrings of the focused string, which are equivalent to it? 

— Among the prime substrings, which are superstrings (substrings) of the fo- 
cused string? 

The symmetric compact DAWG is useful for this purpose. That is, the strings 
equivalent to a focused prime substring x can be obtained from the incoming 
edges of the node x. It {x', x, aj) € En, then any string z such that x'a h z ^ x 
is equivalent to x, where u ^ v means that rt is a substring of v. Similarly, if 
{x' ,x,')o) G El, any string z such that ax' ^ z ^ x are equivalent to x. The 
string x' a (ax') appears exactly once within the string x, and so it is easy to 
grasp the strings z satisfying the inequality. 

To give the domain expert an illustration of the superstring-substring re- 
lation on the prime substrings, we would consider to draw interactively the 
subgraph of the symmetric compact DAWG in which the nodes are restricted to 
the ones reachable from/to a focused node. However, the subgraph is still large 
and complicated for a reasonable size set S of strings in the real world. Instead, 
we shall draw 

— The subgraph consisting of the prime substrings that are substrings of the 
focused one. (It is small enough because the focused string is rather short in 
practice.) 

— The right and left context trees which represent the same information as 
the subgraph consisting of the prime substrings that are superstrings of the 
focused one. The former represents the information in Ej^, and the latter in 
El- 
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The right context tree of a prime substring x is essentially the same as the subtree 
of the node x in the suffix tree [3] for S, but augmented by adding to every node 
xy, xy = xti (y G S*), a label of “ 7 [x]y” such that 'jxy = "x^ . (The left context 
tree is defined in a similar way.) There may be two nodes xyi and xy 2 such that 
2/1 2/2 but iyi = icy 2 - Therefore more than one node may have the same label 

if ignoring the square brackets ([,]). Figure 5 shows the left and the right context 
trees of x = & for S' = {babbc, ababb}. 




Fig. 5. Left and right context trees of x = b for S = {babbc, ababb}. 



5 Experimental Results 

We carried out experiments for Waka poems and prose texts. 



5.1 Goodness Measures 



The statistical measures like the classification error, the information entropy, 
and the Gini index are abstracted as follows [4]. Let p\ and ni denote the num- 
bers of strings in Pos and Neg that contain tt, respectively, and let po = |Pos| —pi 
and no = \Neg\ — n\. Let 



u ^ Pi + ni 

}[pi,m,po,no) = — — Ip 



Pi 



Pi + ni 



Po + no 
N 



f} 



Po 



Po + no 



where = pi -I- ni -|-po + and ip is a, non-negative function with the following 
properties: 

— 'ip{l/2) > ipfr) for any r G [0, 1]. 

— ■^(0) = ^^(l) = 0. 

— ipfr) increases in r on [0, 1/2] and decreases in r on [1/2, 1]. 

The information entropy measure is the function / such that 



ip{r) = — r logr — (1 — r) log(l — r). 



The classification error and the Gini index measures are also obtained by letting 
ipfr) = min(r, 1 — r) and ip{r) = 2r(l — r), respectively. 

In our experiment, we used the information entropy measure. We also tested 
the classification error and the Gini index, but no substantial differences were 
observed. 
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5.2 Text Strings We Used 

Our experiments were performed against the following classical Japanese literary 
works. 

(A) Two private anthologies Sankashu by the priest Saigyb (1118-1190), and 
Shugyokushu by the priest Jien (1155-1225). It is well-known that Saigyb 
was a great influence on Jien. In fact, Jien composed many poems using 
similar expressions preferred by Saigyb. 

(B) Two private anthologies Shuigusb by Fujiwara no Teika (1162-1241), and 
Tameieshu by Fujiwara no Tameie (1198-1275). Teika is a poet and a literary 
theorist, who ranks among the greatest of Waka poets. The poems of his son 
Tameie were influenced by the poems and the theory of Teika. 

(C) The Tale of Genji, written by Murasaki Shikibu (Lady Murasaki), which 
is considered the first novel ever written. It consists of 54 chapters. Some 
scholars have convinced that the Tale of Genji is not all by the same writer. 
Especially, it is often claimed that the author of the main chapters and the 
author(s) of the Uji chapters (the last 10 chapters) are not the same person. 
The aim is to compare the last 10 chapters with the other 44 chapters. 

Table 1 shows the numbers of nondegenerate equivalence classes for the text 
strings of (A), (B), and (G), under the three equivalence relations =l, =r, 
and =. The result imply that the limitation to the prime substrings drastically 
reduce the number of the substrings to be examined by human experts. Thus, the 
notion of the prime substrings makes the substring statistics based text analysis 
be realistic. 



Table 1. Comparison of the numbers of nondegenerate equivalence classes against the 
three equivalence relations, =l, =r, and =. 



Anthologies | 


■ |5| 


l|S|| 


\Sub{S)\ 


1 # nondegen. equiv. classes 


Pos 


Neg 


= L 


= R 


= 


Sankashu 
(1,552 poems) 


Shugyokushu 
(5,803 poems) 


7,355 


229,728 


2,817,4361 


259,576 


265,238 


65,149 


Shuigusb 
(2,985 poems) 


Tameieshu 
(2,101 poems) 


5,086 


158,290 


1,989,446 


183,358 


185,987 


46,288 


Tale of Genji 

(Chap. 1-44) |(Chap. 45-54) 


54 


859,796 


1,493,709,707 


1,182,601 


1,181,439 


251,343 



5.3 Characteristic Expressions Extracted 

It should be noted that a Waka poem consists of five lines, and we can restrict to 
the substrings of lines of Waka poems. In order to exclude the substrings which 
stretch over two or more lines, we let S be the set of lines of Waka poems in the 
anthologies. Then, we have 50,345 and 37,477 prime substrings to be examined 
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for (A) Sankashu and Shugyokushu, and for (B) Shuigushd and Tameieshu, 
respectively. 

On the other hand, there are no punctuations in the texts of “the Tale of 
Genji” we used. We have 54 text strings each of which corresponds to a chapter 
of the book. For this reason, we define p\ (rii) to be the number of occurrences 
of pattern in Pos {Neg), not to be the number of strings that contain it. 

We created lists of prime substrings arranged using the information entropy 
measure. Table 2 shows the best 40 prime substrings for (A) Sankashu and 
Shugyokushu, and (B) Shuigusb and Tameieshu. (We omitted the list for “the 
Tale of Genji.”) From the upper part of the list, we can notice the following, by 
using a prototype of our browser. 

— While Shugyokushu has Buddhist terms like “Nori-no-michi” (the road of 
dharma), “Nori-no-hana” (the flower of dharma), and “Makoto-no-michi” 
(the road of truth) in 20, 16, and 18 poems, respectively, there is no such 
expression in Sankashu. It should be mentioned that the first two terms were 
obtained by browsing the right context tree of the string “no-RI-no,” which 
ranks the 15th, and the last one was obtained from the left context tree of 
the string “no-mi-CHI,” which ranks the 24th. See Fig. 6. 

— There are 21 examples of “•••wo Ikanisemu” (most of them are “Mi wo 
Ikanisemu”) in Shugyokushu and there is none in Sankashu. This is interest- 
ing and seems to suggest their differences in their beliefs in Buddhism and 
in their ways of lives. The expression “wo Ikanisemu” was obtained from the 
left context tree of the string “i-ka-ni-SE,” which ranks the 31st. 

— In Tameieshu, there are many expressions using “Oi-no-nezame” (wakeful 
night for the aged) and “Oi-no-Namida” (tears of the aged), but there is not 
such in Shuigusb at all. Though Tameie was conscious of old age in his poems, 
he did not live longer than his father had done. (Teika died at 80 and Tameie 
at 78.) Surveying Shinpen-Kokkataikan, a collection of 1,162 anthologies of 
Waka poems (about 450,000 poems in total), we find that the expressions 
“Oi-no-mezame” and “Oi-no-Namida” most frequently appear in Tameieshu. 
These expressions, therefore, definitely characterize Tameie’s poetry. In his 
last days, he was involved in the family feud about his successor. It was 
resulted in dividing the dynasty into three as Nijb, Kybgoku, and Reizei in 
the next generation. This is quite a contrast to the case of Teika, who could 
decide Tameie as his only one successor. 

— In the narrative or the characters’ speech in the Uji chapters of the Tale of 
Genji, we observed some characteristic expressions like “Ikanimo-ikanimo,” 
which ranks the 90th, and “Shikiwazakana,” which ranks the 108th. 

But most of the strings collected from the book are proper nouns like char- 
acters’ names, titles and place names, and they largely depend on the story 
settings and the characters. So we have to say that this is far from discovery. 
We could have removed such strings under conditions either of pi = 0 or of 
ni = 0, but there was risk of excluding some other important strings too. 
We need to consider how to adapt a filtering method to prose works. 
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Table 2. Best 40 prime substrings from two pairs of private anthologies. The hyphens 
are inserted between syllables, each of which was written as one Kana character 
although romanized here. 



(A) Sankashu vs. Shugyokushu (B) Shuiguso vs. Tameieshu 





a 




Ai 


Substring 






G 






Substring 


1 


0.5115 


33 


397 


RU-NO 




1 


0.6668 


14 


103 


O-I 


2 


0.5115 


26 


351 


HA-RU-NO 




2 


0.6685 


2 


63 


O-I-NO 


3 


0.5118 


16 


271 


MI-YO 




3 


0.6717 


6 


54 


WO-KU-RA 


4 


0.5120 


919 


2822 


TE 




4 


0.6723 


11 


60 


WO-KU 


5 


0.5122 


29 


344 


NO-SO 




5 


0.6731 


246 


74 


SO-RA 


6 


0.5124 


41 


28 


KO-KO-CHI 




6 


0.6735 


4 


38 


NI-KE-RU 


7 


0.5124 


44 


33 


KO-CHI 




7 


0.6736 


109 


168 


KO-SO 


8 


0.5126 


21 


273 


NO-SO- RA 




8 


0.6738 


3 


34 


KU-RA-YA-MA 


9 


0.5127 


60 


495 


SO-RA 




9 


0.6739 


54 


106 


I-NO 


10 


0.5129 


7 


163 


MI-YO-SHI 




10 


0.6739 


3 


33 


WO-KU-RA- YA-MA 


11 


0.5131 


3 


120 


SU-MI-YO 




11 


0.6740 


0 


23 


O-I-NO-NE 


12 


0.5131 


99 


167 


MA-SHI 




12 


0.6741 


69 


7 


NA-KA-ME 


13 


0.5132 


7 


150 


MI-YO-SHI-NO 




13 


0.6746 


6 


35 


KU-RA-YA 


14 


0.5132 


3 


114 


SU-MI-YO-SHI 




14 


0.6747 


0 


19 


O-I-NO-NE-SA-ME 


15 


0.5133 


7 


147 


NO-RI-NO 




15 


0.6747 


88 


16 


SO-TE-NO 


16 


0.5134 


487 


2281 


YO 




16 


0.6748 


292 


115 


I-RO 


17 


0.5134 


53 


418 


YU-FU 




17 


0.6748 


57 


99 


NI-KE 


18 


0.5134 


8 


150 


TSU-KA-SE 




18 


0.6752 


53 


6 


KI-E 


19 


0.5135 


0 


67 


NO-KO-RU 




19 


0.6752 


9 


36 


RA-YA-MA 


20 


0.5135 


117 


722 


HA-RU 




20 


0.6753 


2 


22 


RA-NO- YA-MA 


21 


0.5136 


1354 


5327 


NO 




21 


0.6753 


77 


114 


NA-MI-TA 


22 


0.5136 


3 


101 


SU-MI-YO-SHI-NO 




22 


0.6754 


53 


7 


KU-YO 


23 


0.5136 


2 


90 


YO-HA-NO 




23 


0.6754 


2 


21 


WO-KU-RA-NO 


24 


0.5137 


0 


62 


NO-MI-CHI 




24 


0.6754 


141 


173 


KA-MI 


25 


0.5137 


14 


3 


KA-NA-SHI-KA 




25 


0.6755 


234 


92 


SO-TE 


26 


0.5138 


1 


75 


KO-RU 




26 


0.6755 


39 


3 


I-KU-YO 


27 


0.5138 


9 


0 


TSU-TSU-MA 




27 


0.6755 


134 


166 


NA-RI 


28 


0.5138 


55 


79 


NI-TE 




28 


0.6756 


42 


74 


KE-RE 


29 


0.5138 


6 


119 


MA- TSU-KA-SE 




29 


0.6756 


54 


8 


HO-HI 


30 


0.5138 


1 


73 


SU-MI-NO 




30 


0.6756 


2 


20 


NO-NE-SA-ME 


31 


0.5138 


4 


102 


I-KA-NI-SE 




31 


0.6757 


91 


123 


RI-KE 


32 


0.5139 


37 


306 


YA-MA-NO 




32 


0.6757 


47 


6 


NI-HO-HI 


33 


0.5139 


152 


848 


KA-SE 




33 


0.6757 


0 


13 


WO-KU-RA-NO-YA-MA-NO 


34 


0.5139 


2 


81 


HO-TO-KE 




34 


0.6758 


4 


23 


RA-NO- YA 


35 


0.5139 


129 


744 


A-KI 




35 


0.6758 


208 


226 


NA-SHI 


36 


0.5139 


330 


912 


TSU-KI 




36 


0.6758 


7 


28 


NE-SA-ME 


37 


0.5140 


23 


223 


NA-HO 




37 


0.6758 


1 


16 


NA-MI-TA-NA 


38 


0.5140 


21 


211 


NO-RI 




38 


0.6758 


1 


16 


WO-KU-RA-NO-YA-MA 


39 


0.5140 


19 


11 


MI-KE 




39 


0.6758 


133 


44 


TE-NO 


40 


0.5140 


74 


131 


TE-NI 




40 


0.6759 


1350 


1089 


SA 



6 Concluding Remarks 



We have reported successful results for Waka poems, but not for prose texts. 
We considered all prime substrings as candidates for characteristic expressions. 
However it seems that some filtering process is needed for prose texts. In [13], we 
successfully discovered from Waka poems characteristic patterns, called Fushi, 
which are regular patterns whose constant parts are restricted to sequences of 
auxiliary verbs and postpositional particles. To find a good filtering for prose 
texts will be our future work. 
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Fig. 6. Left and right context trees of “no-mi-CHI.” In both trees the top string is 
“no-mi-CHI.” In the left context tree (the left one), the 9th and 12th strings from the 
top are “no-ri-no-mi-chi” and “ma-ko-to-no-mi-chi,” respectively. 
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Abstract. In this paper, we propose a method based on the belief de- 
cision tree approach, to classify scenarios in an uncertain context. 

Our method uses both the decision tree technique and the belief function 
theory as understood in the transferable belief model in order to find the 
classes of the scenarios (of a given problem) that may happen in the 
future. 

Two major phases will be ensured: the construction of the belief decision 
tree representing the scenarios belonging to the training set and which 
may present some uncertainty in their class membership, this uncertainty 
is presented by belief functions. Then, the classification of new scenarios 
characterized generally by uncertain hypotheses’ configurations. 



1 Introduction 

A scenario is defined as a set of elementary hypotheses aiming at analyzing future 
events and anticipating what may happen. The task of classifying scenarios is 
one of the major preoccupation facing decision makers since its capability to 
assign similar scenarios to the same class, this will help finding the best strategic 
planning regarding a given problem. 

Due to the uncertainty that may occur either in the configurations of sce- 
nario’s hypotheses or even in the classes of the training scenarios, the clas- 
sification task becomes more and more difficult. Ignoring this uncertainty or 
mistreating it may lead to erroneous results. 

In this paper, we propose a method based on belief decision trees in order to 
ensure the classification of scenarios in an uncertain context. 

The belief decision tree is a classification technique that we have developed 
[4], [5], it is an extension of the standard decision tree approach, based on the 
belief function theory in order to cope with the uncertainty related to the pa- 
rameters of any classification problem. 

The belief decision tree approach offers a suitable framework to deal with 
the classification of scenarios in an uncertain environment. The use of the belief 
function theory as understood in the transferable belief model, allows a better 
representation of uncertainty characterizing the scenarios, especially the uncer- 
tainty expressed by experts. 

This paper is composed as follows: We start by presenting the notion of sce- 
narios and their objectives, then we give an overview of what we call a belief 

S. Arikawa and S. Morishita (Eds.): DS 2000, LNAI 1967, pp. 127-140, 2000. 

(c) Springer-Verlag Berlin Heidelberg 2000 
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decision tree where we introduce at first the basics of the belief function theory 
and those related to the decision tree technique, next we describe the construc- 
tion and the classification procedures related to the belief decision tree approach. 
Finally, we detail our method for classifying scenarios using belief decision trees. 
An example illustrating our proposed method will be presented. 



2 Scenarios 

A scenario describes a future situation. It allows to elicit and to anticipate about 
what may happen. Indeed, it is composed of a set of hypotheses related to the 
components of the given problem (field). Each hypothesis could take one or 
several configurations. Hence, a scenario is considered as a combination of these 
configurations. 

These hypotheses and their configurations are defined through interviews and 
different questions given to experts and actors implicated in the given problem. 
To ensure this objective, several methods are proposed in the literature, the most 
used is the Delphi technique [6]. 

Two types of scenarios are defined: 

— The exploratory scenarios based on past and present trends in order to elicit 
future. They are equivalent to the classical forecasting. 

— The anticipatory scenarios built on the basis of different visions of future 
desired or redoubted. In fact, we have to fix the future objectives and try to 
find how to ensure them. 

The scenario method plays an important role especially in decision problems 
given its capability to deal with various fields and consequently help decision 
makers to find the appropriate strategic planing. 

The scenario method is mainly based on experts’ opinions [3], [6], [7], [9]. It 
includes different steps. The major ones dealing with scenarios are their assess- 
ment and their classification. 

The latter step related to the classification of scenarios allows to group sce- 
narios sharing similar characteristics in the same class. By taking into account 
the class of the scenario, more reliable decisions may be taken. 



3 Belief Decision Trees 

In this section, we briefly review the basics of the belief function theory as 
understood in the Transferable Belief Model (TBM) [15], [16], [17] and those of 
the decision tree technique [10], [11], [12]. Then, we describe our belief decision 
tree approach based on both the belief function theory and the decision tree 
method. 
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3.1 Belief Function Theory 

Definitions. Let 0 be a set of mutually exclusive and exhaustive events referred 
to as the frame of discernment. A basic belief assignment (bba) (called initially 
basic probability assignment [13] ) over 0 is a function m defined as follows: 

m: 2® ^ [0, 1] such that: 

EAce™(^) = 1 

For each subset A belonging to the frame 0, m(A) measures the part of belief 
that is committed exactly to A and which can not be apportioned to any strict 
subset of A. 

We call a subset AC 0 such that m(A)>0, a focal element. A basic belief 
assignment having only 0 as a focal element is called a vacuous belief function 
having the characteristic to represent the total ignorance [13]. 

The belief function bel, corresponding to a basic belief assignment m, repre- 
sents the total belief committed to each subset A of the frame of discernment 
0: bel(A) = E0yscA™(-®) bel($) = 0. 

Note that assessments of the bba are explained in [15] and [18]. 

Combination. Consider two distinct pieces of evidence on the same frame 0 
represented respectively by two bba’s m-i and m 2 . Two kinds of combination at 
least may be defined [15]: 

— The Conjunctive Rule: providing a new bba that represents the combined 
impact of the two pieces of evidence. So we get: 

(mi A m 2 ){A) = Ei3,cce:Bnc=A ™i(S).m2(0) for A C 0 

The conjunctive rule can be seen as an unnormalized Dempster’s rule of 
combination. The Dempster’s rule is defined as [13]: 

(mi ©m2)(A) = Ar.Ei3,cce:Bnc=A ™i(S).m2(0) 

where 

K ^ = I — Es,c'ce:BnC'=0 rni(B) .m2(C) 
and (mi © m 2 ) (9) = 0 
K is called the normalization factor. 

— The Disjunctive Rule: inducing a bba that expresses the case when we only 
know that at least one of the two pieces of evidence actually holds but we 
do not which one. So we get: 

(mi V m 2 ){A) = Ei3,cce:Buc=A ™i(S).m2(0) for A C 0 

Note that since the conjunctive and the disjunctive rules are both commuta- 
tive and associative, so combining several pieces of evidence induced from distinct 
information sources (either conjunctively or disjunctively) may be easily ensured 
by applying repeatedly the chosen rule. 




130 



Zied Elouedi and Khaled Mellouli 



Vacuous Extension of Belief Functions. Let X and Y be two sets of 
variables such that Y C X. Let be a bba defined on the domain of Y, 0y 
which is the cross product of the different variables of Y. The extension of 
to 0x , denoted means that the information in m^ is extended to a larger 

frame X [8]: 

(A X 0x-y) = m^(A) for A C 0y 
(B) = 0 if B is not in the form A x 0x-y 

Decision Process. The Transferable Belief Model (TBM) developed by Smets 
presents a solution to make decisions within the belief function framework. In 
fact, the TBM is based on two levels: 

— a credal level where beliefs are entertained and quantified by using belief 
functions. 

— a pignistic level where beliefs are used to make decisions and where they are 
represented by probability functions called the pignistic probabilities. 

When a decision is needed, we use the pignistic transformation which builds a 
pignistic probability function BetP from the initial bba m (handled in the credal 
level) as follows [16]: 

BetP{6) = J2AC0,eeA A|.(i-m(0)) ’ all 0 G 0 

Note that m(0) is interpreted as the part of belief given to the fact that none 
of the hypotheses in 0 is true or as the amount of conflict between the pieces of 
evidence. In the case of a normalized context the value of m(0) is equal to zero. 



3.2 Decision Trees 

Several learning methods have been developed for ensuring classification. Among 
these techniques, the decision tree may be one of the most commonly used. 

Decision trees are characterized by their capability to break down a com- 
plex decision problem into several simpler decisions. They represent a sequential 
procedure for deciding the membership class of a given instance. Their major 
advantage resides on providing a powerful formalism for expressing classification 
knowledge [10] and providing comprehensible classifiers. 

The decision tree technique is composed of two major procedures: 

1. The first for building the tree: Based on a given training set, a decision tree 
can be built. It consists in finding for each decision node the ’’appropriate” 
test attribute by using an attribute selection measure and also to define the 
class labeling each leaf. As a result, we get a decision tree where decision 
nodes represent attributes, branches correspond to the possible attribute 
values and leaves including sets of instances belonging to the same class. 
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2. The second for the classification: Once the tree is constructed and in order 
to classify a new instance, we start by the root of the decision tree, then we 
test the attribute specified by this node. According to the result of the test, 
we move down the tree branch relative to the attribute value of the given 
instance. This process will be repeated until a leaf is encountered. This leaf 
is labeled by a class. 

3.3 Belief Decision Tree 

Due to the uncertainty that may occur in the parameters of any classification 
problem, we have developed what we call a belief decision tree [4], [5]. Such 
approach presents a classification method in an uncertain context based on both 
the decision tree technique and the belief function theory as explained in the 
TBM. 

The use of the belief function theory in decision trees provides a suitable 
framework thanks to its ability to treat subjective, personal judgments on the 
different parameters (attributes, classes) of any classification problem. It permits 
to represent beliefs not only on elementary hypotheses but also for a collection 
(disjunction) of hypotheses. 

Besides, this theory allows experts to express partial beliefs in a much more 
flexible way than probability functions do. It also permits to handle partial or 
even total ignorance concerning classification parameters. 

Furthermore, it offers appropriate tools to combine several pieces of evidence 
like the conjunctive and the disjunctive rules. Decision making is ensured by 
applying the pignistic transformation. 

In addition to these advantages, it is easy applied in reasoning based systems 
like expert systems, decision support systems... 

Construction Procedure. As described for a standard decision tree, the con- 
struction of a belief decision tree is mainly based on a training set of instances. 
Since we deal with uncertainty, the structure of this set will change from the 
traditional one. 

In fact, we assume that the uncertainty will occur only in the classes of 
training instances. Such uncertainty is generally due to lack of information. 

Since we use the belief function theory, for each training object’s class, we 
define a basic belief assignment showing beliefs given by experts on the different 
classes to which this object may belong. 

Due to the uncertainty in the training set, the leaf is not attached to a unique 
class but it would be labeled by a bba. This bba represents the beliefs on the 
possible classes associated with the path from the root to this leaf. 

Once the structure of the training set is defined, we present our algorithm 
of constructing a belief decision tree which is an extension of the IDS algorithm 
[10] within the belief function framework. 

Let T be the training set, A be the set of attributes and m^{Ij} be the bba 
defined on 0 the set of the n possible classes representing beliefs given by the 
experts on the actual class of the training instance Ij. 
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Our algorithm is described as follows: 



1. Generate the root node containing all the training instances in T. 

2. If the treated node satisfies one of the following conditions (known as a 
stopping criterion): 

— It contains only one object. 

— There is no attribute to test 

— The information gain (defined below) of the remaining attributes are less 
or equal to zero. 

Then, the node is declared as a leaf where its bba m^, is equal to the average 
bba of the objects belonging to this leaf: 



rriL 



\L\ 



( 1 ) 



where \L\ represents the number of instances belonging to the leaf L. 

3. If the node does not satisfy the stopping criterion, for each attribute Ai £ A, 
compute the information gain (Gain(T, Ai)), then the attribute presenting 
the maximum information gain will be selected as a test attribute and will 
be considered as the root of the current tree. 

4. According to the selected attribute values, apply the partitioning strategy 
allowing to divide the training set T into training subsets Ti,T 2 ,... Each 
subset involves the objects having one value of the selected attribute. 

5. Repeat the same process for each training subset while verifying the stopping 
criterion for the remaining attributes. 

6. Stop when all the nodes of the latter level of the tree are leaves. 



The information gain G(T,Ai) of the attribute Ai has to take into account 
the uncertainty lying on the classes. It is defined as follows: 



Gain{T, A,) = Info{T) - Info a, (T) (2) 



where 

n 

Info{T) = -J2BetP^{T}{G,).log2BetP^{T}{G,) (3) 

i=l 

and 

Inf0AAT)= Y. ( 4 ) 

v^values(Ai) 

where BetP^ {T} is the average pignistic probability function taken over the 
training set T and defined by: 



BetP^{T}{G) 



|T| 



( 5 ) 



and T^' is the training subset when the value of the attribute Ai is equal to v. 
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We have to mention that the attribute selection measure used in our algo- 
rithm represents the extension of the information gain of Quinlan to an uncertain 
context by using the basics of the belief function theory. Our attribute selection 
measure allows to handle uncertainty, represented by basic belief assignment 
functions, in classes of training instances. As explained handling beliefs instead 
of probabilities in the training instances is more flexible and presents a more 
generalized context for dealing with uncertainty. 



Classification Procedure. The classification procedure is ensured by taking 
into account the induced belief decision tree. 

As we deal with an uncertain environment, the new instances to classify may 
be characterized by missing or uncertain attribute values. 

In fact, this uncertainty can be represented by a basic belief assignment on 
the set of all the possible values of each attribute. As known the use of bba’s 
includes also particular cases (instances with some certain attribute values, the 
ones with disjunctive values in some of their attributes and the ones with missing 
values in some of their attributes). 

Let be the bba representing the part of beliefs committed exactly to the 
different values relative to the attribute Ai of the new instance to classify. This 
bba is defined on the frame of discernment 0^. including all the possible values 
of the attribute A^. 

Let 0A be the global frame of discernment relative to all the attributes. It is 
equal to the cross product of the different 0Ai- We denote by 0a = Xi=i..fe0Ai 

Since a given instance is described by a set of combination of values where 
each one is relative to an attribute, we have to find the bba expressing beliefs 
on the different attributes’ values of the new instance to classify. In other words, 
we have to look for the joint bba representing beliefs on all the instance’s 

attributes by applying the conjunctive rule of combination: 

(6) 

mAAA jg extension of the bba to the frame 0a- 

Then, for each focal element x relative to , we have to compute the belief 
functions bel® \x\ defined on the set of the possible classes 0 given x. This belief 
function is equal to the result of the disjunctive combination rule between belief 
function’s leaves corresponding to this focal element x. 

Once computed, these belief functions are averaged using m®-’^ such that: 

6eZ® [m®-’^](0) = rn!^^{x).hel'^[x]{6) for 0 G 0 (7) 

X'^&A 



Hence, 6eZ®[m®-’^] represents total beliefs of this new instance to belong to the 
different classes related to the problem. To know the part of belief exactly com- 
mitted to the different classes of 0, we may easily induce m®[m®-’^]. 

Finally, we apply the pignistic transformation to this bba in order to get the 
probability of this instance to belong to each singular class. 
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4 Classifying Scenarios using Belief Decision Trees 

In this section, we present the different steps leading to the classification of 
scenarios using a belief decision tree. Each step will be illustrated by an example 
explaining its real unfolding. 



4.1 Scenarios vs Training Set 

A scenario is seen as a combination of hypotheses’ configurations where each one 
is relative to a hypothesis H^. 

To represent scenarios within a belief decision tree, we consider the hypothe- 
ses as being the attributes, whereas the configurations corresponding to each 
hypothesis are assimilated to the attribute values. 

Hence, we get a training set of scenarios characterized by certain configura- 
tions’ hypotheses, however there may be some uncertainty in their classes defined 
for each one by a bba on the set of classes. These bba’s are given by experts. 



Example 1 Let’s consider a simple example presenting scenarios regarding the 
agriculture field. For simplicity sake, we define only three elementary hypotheses 
composing these scenarios: 

- Hi: the rainfall which can be high or weak. 

- H2: the temperature which can be hot, mild, cold. 

- H3: the wind with values strong or weak. 

A possible scenario is for example having a high rainfall, with hot tempera- 
ture and weak wind. In fact, there are twelve possible scenarios. 

There are three classes to which the scenarios, related to this problem, may 
belong: 

- Ci: regrouping the favorable scenarios for the agriculture field. 

- C2: regrouping the neutral scenarios for the agriculture field. 

- C3: regrouping the disastrous scenarios for the agriculture field. 

We have an expert’s beliefs about the classes of scenarios that have occurred 
over the last six years, we get the following results: 



Table 1. Training set T 



Si Rainfall Temperature Wind bba’s 



Si 


High 


Mild 


Weak mils'll 


S2 


High 


Cold 


Strong m!^ {S 2} 


S3 


High 


Hot 


Weak m^lSa} 


S4 


Weak 


Hot 


Weak m®{S'4} 


Sb 


High 


Mild 


Weak m^lSz} 


Se 


Weak 


Hot 


Strong ra^ {Se} 
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where 

= 0.8; m®{S'i}(0) = 0.2; 

™®{^2}(C'iUC'2) = 1; 

m®{S'3}(C'2) = 0.2; m^jS'sKC's) = 0.4; m^jS'sKC'iUC'a) = 0.2; m®{S'3}(0) = 

0 . 2 ; 

m®{S'4}(03) = 0.6; m®{S'4}(02 U ^ 3 ) = 0.2; m®{S'4}(0) = 0.2; 
m®{S'5}(0i) = 0.9; m®{S'5}(0) = 0.1; 
m®{S'e}(03) = 1; 

As we noted there is some uncertainty concerning the classes of these training 
scenarios. 

For example, 0.8 is the part of belief committed to the scenario Si to belong 
to the class of scenarios favorable for the agriculture, where 0.2 is the part of 
belief committed to the whole frame 0. 

However, for the scenario Sg, the expert is sure that Sg is disastrous, whereas 
for the scenario S 2 , he is certain that it is either favorable or neutral for the 
agriculture and not disastrous. 

This training set will allow us to build the corresponding belief decision tree 
representing a learning taking into account these six training scenarios. 

4.2 Construction Procedure using Sceuctrios 

In order to construct a belief decision tree based on scenarios, we have to adopt 
the parameters used in the construction algorithm of a belief decision tree to the 
case handling scenarios instead of ordinary objects. 

Therefore, the idea is to build a tree taking into account the scenarios be- 
longing to the training set characterized by uncertain classes. In fact, the belief 
decision tree relative to the training scenarios will be built by employing a re- 
cursive divide and conquer strategy (as described in subsection 3.3). Its steps 
can be summarized as follows: 

— By using the information gain measure (extended to the uncertain context), 
a hypothesis (the one having the highest gain) will be chosen in order to 
partition the training set of scenarios. Therefore, the chosen hypothesis is 
selected as the root node of the current tree. 

— Based on a partitioning strategy, the current training set will be divided into 
training subsets by taking into account the configurations of the selected 
hypothesis. 

— When the stopping criterion is satisfied, the training subset will be declared 
as a leaf. 

Once the tree is built. This allows to classify new scenarios. 

Exauiple 2 Let’s continue with the example 1, In order to find the root of 
the belief decision tree relative to the training set T, we have to compute the 
information gain of each hypothesis. 

We start by computing Info(T): 




136 



Zied Elouedi and Khaled Mellouli 



Info{T) = -E”=i BetP^{T}{Ci).log 2 BetP^{T}{Ci) 

To compute the average pignistic probability BetPjT}, we have at first to 
calculate the different BetPjS'i} where i £ {1, 2, 6} (see table 2): 



Table 2. Average BetP’s 



Si 


BetP""{Si}(Ci) 


BetP""{Si}(C2) 


BetP""{Si}(C3) 


Si 


0.86 


0.07 


0.07 


S2 


0.5 


0.5 


0 


S3 


0.17 


0.37 


0.46 


Sr 


0.07 


0.17 


0.76 


Sb 


0.94 


0.03 


0.03 


Se 


0 


0 


1 



BetP^ {T} the average pignistic probability taking over the whole training 
set T, is defined by applying the equation 5: 

BetP^T(C\) = i * (0.86 + 0.5 + 0.17 + 0.07 + 0.94) = 0.42; 

BetP^T(C2) = 0.19; 

BetP^T(C3) = 0.39; 

Hence Info(T) = 1.511; 

Then, we have to compute the information gain relative to each hypothesis 
related to the agriculture filed, we get: 

Gain(T, rainfall) = 1.511 — 1.056 = 0.455; 

Gain(T, temperature) = 1.511 — 0885 = 0.626; 

Gain(T, wind) = 1.511 - 1.464 = 0.047; 

The rainfall hypothesis presents the highest information gain. Thus, it will 
be chosen as the root relative to the training set T. 

So, we get the following decision tree (see Fig 1): 

Rainfall 




T 



rain f all 
hi 



rprainf all . 
^ we 



Si, S2, S3, Sb 



Sr, Se 



Fig. 1. Belief decision tree (first step). 

Now, we have to apply the same process on the two training subsets 
including the scenarios characterized by high rainfall and including 

those characterized by weak rainfall. 
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This process will be halt when the stopping criterion is fulfilled for all the 
training subsets. 

For the problem related to the agriculture field, the final belief decision tree 
is as follows (see Fig 2): 



Rainfall 




Temperature 




m®(S'2) m®(S'3) 




Fig. 2. Belief decision tree representing training scenarios, 
where is the average bba of the subset including the scenarios Si 

and S 5 . So, we get: m®{S'i 5 }(C'i) = 0.85; m®{S'i 5 }( 0 ) = 0.15; 

Classification of New Scencirios. This phase is very important since it al- 
lows to classify scenarios characterized by uncertain hypothesis configurations. 
This classification will be ensured by taking into account the constructed belief 
decision tree. 

In fact, training scenarios (and their classes) presented by the means of a 
belief decision tree constitutes a convenient framework to classify new scenarios. 
Our classification method allows to handle not only uncertain hypotheses’ con- 
figurations (described by basic belief assignments) but includes also hypotheses 
with certain or disjunctive or even unknown configurations 

The classification of a new scenario that may happen, provides a good capa- 
bility to decision makers to fix the appropriate strategic planning according to 
beliefs assigned to the classes to which this scenario may belong. 

Example 3 Let’s continue with the example 2, we would like to classify the 
scenario S lying on the agriculture field that may occur in the year 2002. 

Let’s define by: H = {rainfall, temperature, wind} 

Qratnfaii ^ {high. Weak} 

Qtemperature ^ 

Qwtnd _ {strong , weak} 

And PI ^rainfall ^ ^temperature ^ ^wind 

The expert is not sure about the ” future” configurations of some of the three 
elementary hypotheses related to the scenario S (to classify). He presents his 
opinions as follows: 

m”“”-f““({/iig/i}) = 0.6; m”“”-^““(0r«n/an) = 0.4; 

= 1 ; 
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= 1 ; 

In other words, the scenario S to classify is characterized by some uncertainty 
regarding the rainfall hypothesis, a mild temperature and a total ignorance con- 
cerning the wind hypothesis (presented by a vacuous bba). 

So what will bethe class of this ’’uncertain” scenario S? 



We start by extending the different hypotheses’ bba’s to Oh, we get: 

X Otemperature X 0^,nd) = 0.6; 

HI ^ f {Oj^Huf all X Onraperature X 0.4, 

X {wild} x = 1; 



m 



^dTH(g) 



rain f all X Onniperature X 



) = i; 



In order to get the beliefs committed to the possible scenarios that may 
happen in the year 2002, we have to combine these extended bba’s by using the 
conjunctive rule of combination. We get: 

jjl^H ^ jjjrainfall^0H ^ jyitemperature]0H ^jj^wind^0H such that' 

({(high, mild, strong), (high, mild, weak)}) = 0.6; 

({(high, mild, strong), (high, mild, weak), (weak, mild, strong), 

(weak, mild, weak)'}) = 0.4; 

We note that there are two focal elements with basic belief masses equal 
respectively to 0.6 and 0.4. 



Then, we have to find beliefs on classes (defined on 0) given the configu- 
rations of the hypotheses characterizing the new scenario S to classify. These 
beliefs have to take into account the two focal elements. According to the belief 
decision tree induced in the example 2 (see Fig 2), we get: 

hel^ [{(high, mild, strong), (high, mild, weak)}] = bel^ {Sis}; 
bel^[{(high, mild, strong), (high, mild, weak), (weak, mild, strong), 

(weak, mild, weak)}] = bel^{Sis} V 6eZ®{5'4} V bel^{Se} 

Let bell = bel^{Sis} and bel 2 = bel^{Sis} V 6eZ®{5'4} V bel^{Se} 

The values (see table 3) of beli are induced from the bba m^{Sis}, whereas 
those of bel 2 are computed from the combination of bel^ {Sis}, bel^{Si} and 
6eZ®{S'e} using the disjunctive rule. 



Table 3. Beliefs on classes given the hypotheses’ configurations 





0 


Cl 


C2 


Cg 1 


Cl UC2 


Cl U Cg ( 


C 2 U Cg 0 


bell 


0 


0.85 


0 


0 


0.85 


0.85 


0 


1 


heh 


0 


0 


0 


0 


0 


0.51 


0 


1 
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Hence, these two belief functions will be averaged (using the values of the 
bba m®^), we get: 

[m®^](0) = 0; 

bel^[m^H](Ci) = 0.6 * 0.85 + 0.4 * 0 = 0.51; 

6eZ® [m®^]((72) = 0; 

6eZ® [m®^]((73) = 0; 

UC 2 ) = 0.6*0.85 = 0.51; 

6 eZ®[m®«](C'i UC 3 ) = 0.6*0.85 + 0.4*0.51 =0.71; 

6 eZ® [m ®^]((72 U C 3 ) = 0 
6eZ®[m®^](0) = 1 

Applying the pignistic transformation ^ gives us the probability on each sin- 
gular class, the pignistic probability will be defined as follows: 

BetP^iCi) = 0.71; BetP^{C 2 ) = 0.09; BetP^iCz) = 0.2; 

Hence, the probability that the scenario S belongs respectively to the classes 
Cl, C 2 and C 3 are respectively 0.71, 0.09 and 0.2. So, it seems that the scenario 
S (that may happen in the year 2002) has more chances (0.71) to be favorable 
for the agriculture field. 

5 Conclusion 

In this paper, we have presented a method for classifying scenarios using belief 
decision trees. Our method has the advantage to handle the uncertainty that may 
characterize either the classes of the training scenarios ensuring the construction 
of the belief decision tree or the configurations of the hypotheses making up the 
scenario to classify. 

The result of the classification of scenarios provides a significant help to 
decision makers to conceive their strategic policy. 

Evaluation of this belief decision tree approach and comparisons with classical 
classification techniques are now our major research interest. 
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Abstract. Given two sets of strings, consider the problem to find a 
subsequence that is common to one set but never appears in the other 
set. The problem is known to be NP-complete. We generalize the problem 
to an optimization problem, and give a practical algorithm to solve it 
exactly. Our algorithm uses pruning heuristic and subsequence automata, 
and can find the best subsequence. We show some experiments, that 
convinced us the approach is quite promising. 



1 Introduction 

String is one of the most fundamental structure to express and reserve infor- 
mation. In these days, a lot of string data are available. String processing has 
vast application area, such as Genome Informatics and Internet related works. 
It is quite important to discover useful rules from large text data or sequential 
data [1, 6, 9, 22]. Finding a good rule to separate two given sets, often referred as 
positive examples and negative examples, is a critical task in Discovery Science 
as well as Machine Learning. 

Shimozono et al. [20] developed a machine discovery system BONSAI that 
produces a decision tree over regular patterns with alphabet indexing, from given 
positive set and negative set of strings. The core part of the system is to gener- 
ate a decision tree which classifies positive examples and negative examples as 
correctly as possible. For that purpose, we have to find a pattern that maximizes 
the goodness according to the entropy information gain measure, recursively at 
each node of trees. In the current implementation, a pattern associated with 
each node is restricted to a substring pattern, due to the limit of computation 
time. One of our motivations of this study is to extend the BONSAI system to 
allow subsequence patterns as well as substring patterns at nodes, and accelerate 
the computation time. 

However, there is a large gap between the complexity of finding the best 
substring pattern and subsequence pattern. Theoretically, the former problem 
can be solved in linear time, while the latter is NP-hard. 

In this paper, we give a practical solution to find the best subsequence pat- 
tern which separates a given set of strings from the other set of strings. We 
propose a practical implementation of exact search algorithm that practically 
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avoids exhaustive search. Since the problem is NP-hard, essentially we are forced 
to examine exponentially many candidate patterns in the worst case. Basically, 
for each pattern w, we have to count the number of strings that contain w as 
a subsequence in each of two sets. We call the task of counting the numbers 
as answering subsequence query. The computational cost to find the best sub- 
sequence pattern mainly comes from the total amount of time to answer these 
subsequence queries, since it is relatively heavy task if the sets are large, and 
many queries will be needed. In order to reduce the time, we have to either (1) 
asking queries as few as possible, or (2) speeding up to answer queries. We attack 
the problem from both these two directions. 

At first, we reduce the search space by appropriately pruning redundant 
branches that are guaranteed not to contain the best pattern. We use a heuris- 
tics inspired by Morishita and Sese [18], combined with some properties on the 
subsequence languages. 

Next, we accelerate answering for subsequence queries. Since the sets of 
strings are fixed in finding the best subsequence pattern, it is reasonable to 
preprocess the sets so that answering subsequence query for any pattern will 
be fast. We take an approach based on a deterministic finite automaton that 
accepts all subsequences of a string. Actually, we use subsequence automata for 
sets of strings, developed in [11]. Subsequence automaton can answer quickly for 
subsequence query, at the cost of preprocessing time and space requirement to 
construct it. 

Since these two approaches are different in their aims, we expect that a 
balanced integration of these two would result in the most efficient way to find the 
best subsequence patterns. In order to verify the performance of our algorithm, 
we are performing some experiments on these two approaches. We report some 
results of the experiments, that convinced us it is quite promising. 

2 Preliminaries 

Let A be a finite alphabet, and let S* be the set of all strings over A. For a 
string w, we denote by jrcj the length of w, and for a set S, we denote by [S'] 
the cardinality of S. We say that a string w is a prefix {substring, suffix, resp.) of 
w w = vy {w = xvy, w = xv, resp.) for some strings x,y G A*. We say that 
a string w is a subsequence of a string w \i v can be obtained by removing zero 
or more characters from w, and say that w is a supersequence of v. We denote 
by V w that u is a substring of w, and by v w that v is a subsequence 
of w. For a string v, we define the substring language and subsequence 

language as follows: 

L‘*’{v) = {w G E* \v w}, and 

U‘^{v) = {w G E* \ V w} , respoctlvely. 

The following lemma is obvious from the definitions. 

Lemma 1. For any strings v,w G A*, 
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1. if V is a prefix of w, then v :<^tr w, 

2. if V is a suffix of w, then v :<,tr w, 

3- if V ^,trW then v w, 

4- V <,tr w if and only if L“’’{v) 3 L“’’{w), 

5. V w if and only if L““{v) D L‘“’(w). 

3 Formulation of the Problem 

Let good be a function from H* x 2^ x 2^ to the set of real numbers. We 
formulate the problem to be solved as follows. 

Definition 1 (Finding the best pattern according to good). 

Input Two sets S,T C E* of strings. 

Output A string w G E* that maximizes the value good{w, S,T). 

Intuitively, the value good{w, S,T) expresses the goodness to distinguish S from 
T using the rule specified by a string w. The definition of good varies for each 
application. For examples, the values, entropy information gain, and gini 
index are frequently used (See [18]). Essentially, these statistical measures are 
defined by the numbers of strings that satisfy the rule specified by w. In this 
paper, we only consider the rules defined as substring languages and subsequence 
languages. We call these problems as finding the best substring pattern, and 
finding the best subsequence pattern, respectively. Let L be either or 
Then any of the above examples of the measures can be described in the following 
form. 



good{w,S,T) = f{xyj,yw, [-S'], |T|), where 
Xu, = [STl L{w)\, 

Vw = \Tr L{w)\. 



For example, the entropy information gain, which is introduced by Quin- 
lan [19] and also used in BONSAI system [20], can be defined in terms of the 
function / as follows: 



/(x, y, Xmax, 2/max) 



where I{s,t) 



x + y 






^max “r Umax 
^max X 2/max 
^max H” 2/max 



-/(Xmax 2/max 



y). 



r 0 (if 5 = 0 or t = 0), 

I - log log (otherwise) . 



When the sets S and T are fixed, the values Xmax = [-S'! and //max = |T| 
become constants. Thus, we abbreviate the function /(x, y, Xmax, 2/max) to /(x, y) 
in the sequel. 

Since the function good{w, S,T) expresses the goodness of a string w to 
distinguish two sets, it is natural to assume that the function / satisfies the 
conicality, defined as follows. 
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Definition 2. We say that a function f{x,y) is conic if 

— for any 0 < y < j/max; there exists an xi such that 

• f{x,y) > f{x',y) for any 0 < x < x' < x\, and 

• f(x,y) < f{x',y) for any xi < x < x' < Xmax- 

— for any 0 < x < Xmax; there exists a yi such that 

• f{x,y) > f{x,y') for anyO<y<y'< yi, and 

• f(x,y) < f{x,y') for any yi < y < y' < 2/max ■ 

Actually, all of the above statistical measures are conic. We remark that any 
convex function is conic. 

Lemma 2. Let f{x,y) he a conic function defined over [0,Xmax] x [0,j/max]- For 
any 0 < x < x' < Xmax and 0 < y < y' < ymax, we have 

f{x,y) and 

f -iU ^ {x, y) , f (x, ^max) 5 f (^maxi 5 f (^maxi ymax) }• 

Proof. We show the first inequality only. The second can be proved in the same 
way. Since / is conic, we have /(x, y) < max{/(x, 0), /(x, t/')}- Moreover, we have 
/(x,0) < max{/(0,0),/(x',0)} and/(x,i/') < max{/(0, y'), /(x', y')}- Thus the 
inequality holds. □ 

In the rest of the paper, we assume that any function / associated with the 
objective function good is conic, and can be evaluated in constant time. 

Now we consider the complexity of finding the best substring pattern and 
subsequence pattern, respectively. It is not hard to show that finding the best 
substring pattern can be solved in polynomial time, since there are only 0{N’^) 
substrings from given sets of strings, where N is the total length of the strings, 
so that we can check all candidates in a trivial way. Moreover, we can solve it in 
linear time, by using generalized suffix trees [12]. 

Theorem 1. We can find the best substring pattern in linear time. 

On the other hand, it is not easy to find the best subsequence pattern. First 
we introduce a very closely related problem. 

Definition 3 (Consistency problem for subsequence patterns). 

Input: Two sets S,T C E* of strings. 

Question: Is there a string w that is a subsequence for each string s € S, but 
not a subsequence for any string t G T? 

The problem can be interpreted as a special case of the finding the best 
subsequence pattern. The next theorem shows the problem is intractable. 

Theorem 2 ([13,16,17]). The consistency problem for subsequence patterns 
is NP-complete. 

Therefore, we are essentially forced to enumerate and evaluate exponentially 
many subsequence patterns in the worst case, in order to find the best sub- 
sequence pattern. In the next section, we show a practical solution based on 
pruning search trees. Our pruning strategy utilizes the property of subsequence 
languages and the conicality of the function. 
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4 Pruning Heuristics 

In this section, we introduce two pruning heuristics, inspired by Morishita and 
Sese [18], to construct a practical algorithm to find the best subsequence pattern. 
For a conic function f{x,y), we define 

F{x,y) = max{/(x,y),/(a:,0),/(0,y),/(0,0)}, and 

^(^7 y) “ niaxl/ (x, y) , /(x, ymax) ? f (^^max? j f (^^max; ^max)}- 

Theorem 3. For any strings v,w € S* with v w, 

fi^W^Vw) ^ F{Xy^yy)^ ( 1 ) 

fi^v^yv) ^ G{xn)^y^^. ( 2 ) 

Proof. By Lemma 1 (5), v w implies that Thus Xy = 

[S' n L®°‘’(t;)| > [S' n = Xyj. In the same way, we can show yy > yy,. By 

Lemma 2, we have f{xy,,yy,) < F(xy,yy). The second inequality can be verified 
similarly. □ 

In Fig. 1, we show our algorithm to find the best subsequence pattern from 
given two sets of strings, according to the function /. Optionally, we can specify 
the maximum length of subsequences. We use the following data structures in 
the algorithm. 

StringSet Maintain a set S of strings. 

— void append(string w) : append a string w into the set S. 

— int numOfSubseq(string seq) : return the cardinality of the set {w € S' | 
seq A 33 ,, u;}. 

— int numOfSuperseq(string seq) : return the cardinality of the set {w G S | 
w A 33 ,, seq}. 

PriorityQueue Maintain strings with their priorities. 

— bool empty{) : return true if the queue is empty. 

— void push(string w, double priority) : push a string w into the queue with 
priority priority. 

— (string, double) pop() : pop and return a pair {string, priority), where 
priority is the highest in the queue. 

The next theorem guarantees the completeness of the algorithm. 

Theorem 4. Let S and T he sets of strings, and t he a positive integer. The 
algorithm FindMaxSubsequence(S , T, i) will return a string w that maximizes 
the value good{w,S,T) among the strings of length at most i. 

Proof. First of all, we consider the behavior of the algorithm whose lines marked 
by are commented out. That is, we first assume that the lines 10, 13 and 20- 
23 are skipped. In this case, we show that the algorithm performs the exhaustive 
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1 string FindMaxSubsequence{StringSet S, T, int maxLength = oo) 

2 string prefix, seq, maxSeq; 

3 double upper Bound = oo, maxVal = — oo, val; 

4 int X, y; 

5 StringSet Forbidden = 0; 

6 PriorityQueue queue; /* Best First Search*/ 

7 queue. push/ ” , oo); 

8 while not queue. empty{) do 

9 {prefix, upperBound) = queue. pop{); 

10 * if upperBound < maxVal then break; 

1 1 foreach c £ E do 

12 seq= prefix+ c; /* string concatenation */ 

13 * if Forbidden. numOfSuperseq(seq)—— 0 then 

14 X = S .numOfSubseq{seq); 

15 y = T .numOfSubseq{seq); 

16 val = f{x,y); 

17 if val > maxVal then 

18 maxVal = val; 

19 maxSeq — seq; 

20 * upperBound = max{/(a:, y), f{x, 0), /(O, y), /(O, 0)}; 

21 * if upperBound < maxVal then 

22 * Forbidden. append{seq); 

23 * else 

24 if \seq\ < maxLength then 

25 queue. push{seq, upperBound); 

26 return maxSeq; 



Fig. 1. Algorithm FindMaxSubsequence. In our pseudocode, indentation indicates 
block structure, and the break statement is to jnmp ont of the closest enclosing loop. 



search in a breadth first manner. Since the value of upperBound is unchanged, 
PriorityQueue is actually equivalent to a simple queue. The lines 14-16 eval- 
uate the value good{seq, S, T) of a string seq, and if it exceeds the current max- 
imum value maxVal, we update maxVal and maxSeq in lines 17-19. Thus the 
algorithm will examine all strings of length at most I, in increasing order of the 
length, and it can find the maximum. 

We now consider the lines 20, 21, and 23. Let v be the string currently 
represented by the variable seq. At lines 14 and 15, Xy and are computed. At 
line 20, upperBound = F{xy,yy) is estimated and if upperBound is less than the 
current maximum value maxVal, the algorithm skips pushing v into the queue. 
It means that any string w of which v is a prefix will not evaluated. We can 
show that such a string w can never be the best subsequence as follows. Since v 
is a prefix of w, we know v is a subsequence of w, by Lemma 1 (1) and (3). By 
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Theorem 3 (1), the value f{xw,yw) < F{xy,yy), and since F(xv,yv) < maxVal, 
the string w can never be the maximum. 

Assume the condition upperBound < maxVal holds at line 10. It implies that 
any string v in the queue can never be the best subsequence, since the queue 
is a priority queue so that F{xy,yy) < upperBound, which means f{xy,yy) < 
F{xy,yy) by Theorem 3 (1). Therefore f{xy,yy) < maxVal for any string v in 
the queue, and we can jump out of the loop immediately. 

Finally, we take account of lines 13 and 22. Initially, the set Forbidden 
of strings is empty. At line 22, a string v is appended to Forbidden only if 
upperBound = F{xy,yy) < maxVal. At line 13, if the condition 
Forbidden. numOfSuperseq{seq)== 0 

does not hold, seq will not be evaluated. Moreover, any string of which seq is 
a prefix will not be evaluated either, since we does not push seq in the queue 
at line 25 in this case. Nevertheless, we can show that these cuts never affect 
the final output as follows. Assume that Forbidden. numOfSuperseq{seq)^ 0 for 
a string seq. It implies that there exists a string u € Forbidden such that seq 
is a supersequence of u. In another word, u is a subsequence of seq. Since u is 
in Forbidden, we know that F{xu,yu) < maxVal at some moment. By Theo- 
rem 3 (2), the value /(a:„,, can never exceeds maxVal. Thus the output of 
the algorithm is not changed by these cuts. □ 

By the above theorem, we can safely prune the branches. We now consider 
the cost of performing these heuristics. The cost of the first heuristics at lines 
20, 21, and 23 is negligible, since evaluating the upperBound at line 20 is neg- 
ligible compared to evaluate x and y at lines 14 and 15. On the other hand, 
the second heuristics at lines 13 and 22 may be expensive, since the evaluation 
of Forbidden. numOfSuperseq{seq) may not be so easy when the set Forbidden 
becomes large. 

Anyway, one of the most time-consuming part of the algorithm is the lines 14 
and 15. Here, for a string seq, we have to count the number of strings in the sets 
S and T that are subsequences of seq. We remark that the set S and T are fixed 
within the algorithm FindMaxSubsequence. Thus we have a possibility to speed 
up counting, at the cost of some appropriate preprocessing. We will discuss it in 
the next section. 

5 Using Subsequence Automata 

In this section, we pay our attention to the following problem. 

Definition 4 (Counting the matched strings). 

Input A finite set S C S* of strings. 

Query A string seq G E* . 

Answer The cardinality of the set S n L“'‘{seq). 

Of course, the answer to the query should be very fast, since many queries 
will arise. Thus, we should preprocess the input in order to answer the query 
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quickly. On the other hand, the preprocessing time is also a critical factor in 
our application. In this paper, we utilize automata that accept subsequences 
of strings. Baeza-Yates [5] introduced the directed acyclic subsequence graph 
(DASG) of a string t as the smallest deterministic partial finite automaton that 
recognizes all possible subsequences of t. By using DASG of t, we can determine 
whether a string s is a subsequence of a string t in 0(|s|) time. He showed a 
right-to-left algorithm for building the DASG for a single string. On the other 
hand, Tronicek and Melichar [21] showed a left-to-right algorithm for building 
the DASG for a single string. 

We now turn our attention to the case of a set S of strings. A straightforward 
approach is to build DASGs for each string in S. Given a query string seq, we 
traverse all DASGs simultaneously, and return the total number of DASGs that 
accept seq. It clearly runs in 0{k\seq\) time, where k is the number of strings in 
S. When the running time is more critical, we can build a product of k DASGs 
so that the running time becomes 0{\seq\) time, at the cost of preprocessing 
time and space requirement. This is the DASG for a set of strings. 

Baeza-Yates also presented a right-to-left algorithm for building the DASG 
for a set of strings [5]. Moreover, Tronicek and Melichar [21], and Grochemore 
and Tronicek [7] showed left-to-right algorithms for building the DASG for a set 
of strings. 

In [11], we considered a subsequence automaton as a deterministic complete 
finite automaton that recognizes all possible subsequences of a set of strings, 
that is essentially the same as DASG. We showed an online construction of 
subsequence automaton for a set of strings. Our algorithm runs in 0{\S\{m + 
k) + N) time using 0{\S\m) space, where [Al is the size of alphabet, N is the 
total length of strings, and m is the number of states of the resulting subsequence 
automaton. This is the fastest algorithm to construct a subsequence automaton 
for a set of strings, to the best of our knowledge. We can extend the automaton 
so that it answers the above Counting the matched strings problem in a natural 
way (See Fig. 2). 

Although the construction time is linear to the size m of automaton to be 
built, unfortunately m = 0{n^) in general, where we assume that the set S 
consists of k strings of length n. (The lower bound of m is only known for the 
case fc = 2, as m = ^2{n^) [7].) Thus, when the construction time is also a critical 
factor, as in our application, it may not be a good idea to construct subsequence 
automaton for the set S itself. Here, for a specified parameter mode > 0, we 
partition the set S into d = k/mode subsets Si, S 2 , ■ ■ ■ , Sd of at most mode 
strings, and construct d subsequence automata for each Si. When asking a query 
seq, we have only to traverse all automata similutaneously, and return the sum 
of the answers. In this way, we can balance the preprocessing time with the total 
time to answer (possibly many) queries. In the next section, we experimentally 
evaluate the optimal value of the parameter mode in some situation. 
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a 




Fig. 2. Subsequence automaton for S = {ahab, abb, bb}, where E = {a, b}. Each number 
on a state denotes the number of matched strings. For example, by traverse the states 
according to a string ab, we reach the state whose number is 2. It corresponds to the 
cardinality \L‘^‘'{ab) n S'! = 2, since ab abab, ab abb and ab 2 <seq bb. 



6 Implementation and Experiments 

In this section, we report some results on our experiments. We are implementing 
our algorithm in Fig. 1 using C++ language with Standard Template Library 
(STL). For the PriorityQueue, we use the standard priority .queue in STL. 
Concerning with the StringSet, we have implemented the function numOfSub- 
seq (seq) in the following two ways depending on the value of mode. In case of 
mode = 0, we do not use subsequence automata. For each string w in the set, we 
check whether seq is a subsequence of w or not in a trivial way, and return the 
number of matched strings. Thus we do not need to preprocess the set. For the 
cases mode > 1, we construct k/mode subsequence automata in the preprocess, 
where k is the number of strings in the set. On the other hand, the function 
numOfSuperseq{seq) is implemented in a trivial way without using any special 
data structure. 

We examined the following two data as input. 

Transmembrane Amino acid sequences taken from the PIR database, that 
are converted into strings over binary alphabet S = {0, 1}, according to 
the alphabet indexing discovered by BONSAI [20]. The average length of 
the strings is about 30. Si consists of 70 transmembrane domains, and T\ 
consists of 100 non-transmembrane domains. 

DNA DNA sequences of yeast genome over E = {A, T, G, C}. The lengths of the 
strings are all 30. We selected two sets S '2 and T 2 based on the functional 
categories. |S' 2 | = 31 and IT 2 I = 35. 

We note that (S'i,Ti) is an easy instance, while (52, T 2 ) is a hard instance, 
in the sense that the best score for (5i, Ti) is high, while that for {S 2 , T 2 ) is low. 
As we will report, the facts affect the practical behaviors of our algorithm. 

In order to verify the effect of the first heuristics and the second heuristics, 
we compared the searching time to find the best subsequence pattern of our 
algorithm. 
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- -X - - pruning 1 
— 0- - - pruning2 
— A — exhaustive 




(a) number (Transmembrane) (b) time (Transmembrane) 




(c) number (DNA) (d) time (DNA) 



Fig. 3. Number of strings actually evaluated and running time, where maxLength 
varies. 



pruningl We use the first heuristics only, by commented out the lines 13 and 
22 . 

pruning2 We use both the first and second heuristics. 

exhaustive We do not use any heuristics, by commented out the lines 10, 13 
and 20-23. 

Our experiments were carried out both on a workstation AlphaServer DS20 
with an Alpha 21264 processor at 500MHz running Tru64 UNIX operating sys- 
tem (WS), and on a personal computer with Pentium III processor at 733MHz 
running Linux (PC). 

First we verified the effect of the first heuristics and the second heuristics. 
Fig. 3 shows the numbers of strings actually evaluated and the running time 
at PC, when maxLength varies and mode was fixed to 0. The both graphs (a) 
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Table 1. Preprocessing time and search time (seconds) at PC. The data is Transmem- 
brane. 



mode 


0 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


preprocessing 


- 


0.023 


0.054 


0.120 


0.273 


0.470 


0.796 


1.378 


2.108 


3.083 


4.543 


exhaustive 


1.502 


1.560 


0.906 


0.710 


0.599 


0.535 


0.494 


0.460 


0.425 


0.414 


0.379 


pruningl 


0.067 


0.077 


0.046 


0.037 


0.031 


0.025 


0.023 


0.022 


0.020 


0.019 


0.018 


pmning2 


0.060 


0.069 


0.047 


0.040 


0.035 


0.033 


0.031 


0.030 


0.029 


0.029 


0.028 



and (c) show that the pruning2 gives the most effective pruning with respect to 
the number of evaluated strings, as we expected. For example, pruning2 reduces 
the search space approximately half compared to pruningl, when maxLength is 
14 in (c). However, the running time behaves differently as we expected. The 
graph (b) shows that the running time reflects the number of evaluated strings, 
while the graph (c) shows that pruning2 was much slower than pruningl. This is 
because the overhead of maintaining the set Forbidden and the response time of 
the query to Forbidden, since we implemented it in a trivial way. By comparing 
(a) and (b) with (c) and (d) respectively, we see that the instance (S'i,Ti) of 
Transmembrane is easy to solve compared to (S' 2 , 72 ) of DNA, because some 
short subsequences with high score were found in an early stage so that the 
search space is reduced drastically. 

We now verify the effect of introducing subsequence automata. Table 1 shows 
the preprocess time, and search time for each search method, where mode is 
changed from 0 to 10. We can see that the preprocessing time increases with 
the mode, as we expected, since the total size of the automata increases. On 
the other hand, the search time decreases monotonically with the mode for any 
search method except the case mode = 0, since each subsequence query will be 
answered quickly by using subsequence automata. The search time in the case 
mode = 1 is slightly slower than that in the case mode = 0. It implies that 
traversing an automaton is not so faster than naive matching of subsequence 
when answering subsequence queries. We suppose that the phenomena arise 
mainly from the effect of CPU caches. 

In order to see the most preferable value of mode at which the total running 
time is minimized, refer to Fig. 4 (a), (b), and (c) that illustrates Table 1. The 
total running time, that is the sum of preprocessing and search time, is mini- 
mized at mode = 3 for exhaustive search (a). On the other hand, unfortunately, 
for both pruningl in (b) and pruning2 in (c), the total running time is minimized 
at mode = 0. It means that in this case, subsequence automata could not reduce 
the running time. Especially, at the workstation (d), search without using sub- 
sequence automata {mode = 0) is much faster than any other mode. We guess 
that it is also caused by the CPU caches. 

By these results, we verified that the pruning heuristics and subsequence 
automata reduce the time to find the best subsequence pattern, independently. 
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□ preprocess time □ search time 



(sec) 




mode 

(a) exhaustive (PC) 

(sec) 




mode 

(c) pmning2 (PC) 



(sec) 




mode 

(b) pruning 1 (PC) 



(sec) 




mode 

(d) pruning2 (WS) 



Fig. 4. Total running time of (a) exhaustive search and (b)(c)(d) pruning search. The 
experiments (a), (b) and (c) are performed at PC, while (d) at WS. 



7 Concluding Remarks 

We have discussed how to find a subsequence that maximally distinguishes given 
two sets of strings, according to a specified objective function. The only require- 
ment to the objective function is the conicality, that is weaker than the convexity, 
and almost of all natural measures to distinguish two sets will satisfy the prop- 
erty. 

In this paper, we focused on finding the best subsequence pattern. However, 
we can easily extend our algorithm to enumerate all strings whose values of the 
objective function exceed the given threshold, since essentially we examine all 
strings, with effective pruning heuristics. Enumeration may be more preferable 
in the context of text data mining [6,9,22]. 




A Practical Algorithm to Find the Best Subsequence Patterns 



153 



In our current implementation, the function numOfSuperseq is realized in a 
trivial way, that slows down the pruning2 in some situation. If we can construct 
a supersequence automata efficiently, the second heuristic will be more effective. 

We remark that the function G in Theorem 3 (2) is not actually used in our 
algorithm, since our algorithm starts from the empty string and tries to extend 
it. Another approach is also possible, that starts from a given string and tries 
to shrink it. In this case, the function G will be applicable. 

In [8, 15] an episode matching is considered, where the total length of the 
matched strings is bounded by a given parameter. It will be very interesting to 
extend our approach to find the best episode to distinguish two sets of strings. 
Moreover, it is also challenging to apply our approach to find the best pattern in 
the sense of pattern languages introduced by Angulin [2], where the related con- 
sistency problems are shown to be very hard [13, 14, 17]. Arimura et al. showed 
an another approach to find the best proximity pattern [3,4, 10]. It may be in- 
teresting to combine these approaches into one. 

We plan to install our algorithm into the core of the decision tree generator 
in the BONSAI system [20]. 
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Abstract. In modeling various signals such as the speech signal by using 
the Hidden Markov Model (HMM), it is often required to adapt not only 
to the inherent nonstationarity of the signal, but to changes of sources 
(speakers) who yield the signal. The well known Baum-Welch algorithm 
tries to adjust HMM so as to optimize the fit between the model and 
the signal observed. In this paper we develop an algorithm, which we 
call the on-line Baum-Welch algorithm, by incorporating the learning 
rate into the off-line Baum-Welch algorithm. The algorithm performs in 
a series of trials. In each trial the algorithm somehow produces an HMM 
Mt, then receives a symbol sequence Wt, incurring loss — \nPr{wt\Mt) 
which is the negative log-likelihood of the HMM Mt evaluated at wt- 
The performance of the algorithm is measured by the additional total 
loss, which is called the regret, of the algorithm over the total loss of 
a standard algorithm, where the standard algorithm is taken to be a 
criterion for measuring the relative loss. We take the off-line Baum-Welch 
algorithm as such a standard algorithm. To evaluate the performance of 
an algorithm, we take the Gradient Descent algorithm. Our experiments 
show that the on-line Baum-Welch algorithm performs well as compared 
to the Gradient Descent algorithm. We carry out the experiments not 
only for artificial data, but for some reasonably realistic data which is 
made by transforming acoustic waveforms to symbol sequences through 
the vector quantization method. The results show that the on-line Baum- 
Welch algorithm adapts the change of speakers very well. 



1 Introduction 

The Hidden Markov Model (HMM, for short), which can be viewed as a stochas- 
tic signal model to produce sequences of symbols, has been extensively used to 
model the sequences of observations in various research fields such as speech 
recognition and computational biology. In application it is often required to ad- 
just the model parameters of HMM to optimize the fit between the model and 
the signal sequences, where the fitness is measured by the likelihood that the 
resultant HMM assigns on the sequences. More generally this is formulated as 
the approximate maximum likelihood model (MLM) problem for HMMs: Find 
an HMM with a given topology that assigns likelihood on the input sequences 
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not much smaller than the maximum likelihood assigned by an optimal HMM. 
Abe and Warmuth [1] showed that although the approximate MLM problem 
is equivalent to the PAC learning problem for HMMs, they are NP-hard if the 
alphabet size is not bounded by a constant^. Practically, the EM (expectation- 
maximization) technique known as Baum- Welch algorithm [7] is widely used 
to seek for an HMM with sufficiently large likelihood. The Baum- Welch algo- 
rithm is a method of reestimating the parameters of the HMM so as to in- 
crease the likelihood of a set of symbol sequences being observed. Given M\ 
as an initial HMM and a set of symbol sequences = (wi, . . . ,Wt), the 
Baum- Welch algorithm produces a new HMM M such that the likelihood for 
never decreases as the model changes, i.e., Pr(tn^|Mi) < Pr{w'^\M). Here 
each Wt = (wf,i, . . . , is a sequence of symbols of certain length It, and 

we assume that Pr{w'^\M) = Ilf=i |M), that is, the likelihood for each 
sequence is independently evaluated. Taking the HMM obtained as the revised 
initial HMM and rerunning the Baum-Welch algorithm for it, we can perform 
the Baum-Welch algorithm repeatedly until we get an HMM M such that the 
likelihood Pr{w'^\M) converges to a maximal value. It should be noticed that 
since Pr{w'^\M) might be trapped in a local maximum, the HMM M obtained 
in this way is not guaranteed to maximize the likelihood Pr{w'^\M). In what 
follows, we call an individual symbol sequence Wt an instance. 

In this paper we consider the approximate MLM problem in the on-line 
framework. Namely, the algorithm is presented an instance (a symbol sequence), 
one at a time, and at each time an instance is given, the algorithm must pro- 
duce an HMM in an attempt that the likelihood for the next instance is large. 
Our work is motivated by applications such as the speaker adaptation, where 
instances are generated by a signal source (speaker) that may change with time. 
So, our objective is not to get a single HMM M as the final hypothesis that 
assigns high likelihood on the whole instances (i.e., to make Pr{w^\M) large), 
but to get an HMM Mt for each time t that predicts the next instance Wt with 
high probability (i.e., to make Pr{wt\Mt) large). 

More precisely, the protocol proceeds in trials as follows. In each trial t, the 
algorithm somehow produces an HMM Mt and then receives an instance Wt- In 
this trial the algorithm incurs a loss defined as 

L(wt,Mt) = -\nPr(wt\Mt), 

which is the negative log-likelihood that Mt assigns on Wt- The total loss of the 
algorithm for the whole sequence of the instances is given by L(wt, Mt). 

First consider the following regret or relative loss 

T T 

^ L(tnt, Mt) - inf ^ L(tnt, M*). 
f=i f=i 

^ More exactly, the results are proved for the class of probabilistic automata (PAs) 
which are closely related to HMMs. In particular, the hardness result for PAs implies 
that the approximate MLM problem for HMMs with some topology is NP-hard. 
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This is the additional total on-line loss of the algorithm over the total loss of 
the best HMM M* . Our first goal of the algorithm is to minimize the regret no 
matter what the instance sequence is given. As the regret gets smaller, the 
algorithm is more likely to have performed as well as the best off-line (batch) 
algorithm. (Later we will give another notion of regret called the adaptive-regret 
that captures the adaptiveness of algorithms more effectively.) Note that the best 
HMM M* is the maximum likelihood model for because L{wt,M*) = 
Pr{wt\M*) = -lnPr(tn^|M*). 

There are a number of similar related works on the on-line estimation of 
HMM parameters [6,4]. The protocol used there is slightly different from ours 
in that symbols (not sequences) are presented on-line, resulting in a long single 
sequence. Namely, in each trial t after producing an HMM Mf, the algorithm 
observes a symbol Wt rather than a symbol sequence. Note that our protocol 
can be seen as a special case of theirs, because can be regarded as a single 
sequence if the “reset” symbol is assumed to be inserted between instances to 
form the single sequence. In other words, our protocol gives the algorithm the 
additional information on the times when the model should be reset. Obviously, 
for applying to speech recognition, our protocol is more appropriate because 
the breaths in speech play the role of the reset symbol. Moreover, in the usual 
setting, a target HMM Mq is assumed to exist, which generates the sequence 
w = (wi, . . . ,wt), and the performance of the algorithm is measured in terms 
of the speed of convergence, i.e., how fast the hypotheses Mt’s converge to the 
target Mq. This contrasts with our framework in which we do not need to assume 
any target HMM and the performance of the algorithm is measured in terms of 
the regret. 

There is a large body of work on proving regret bounds that has its root in 
the Minimum Description Length community [10,14,11,12,15,16]. So far the 
results developed in this area can apply only to the case where the model class 
of probability density functions is simple, say, to the exponential families^. In 
most cases the regrets are of the form of O(lnT). For a similar related problem 
for linear regression, a simple algorithm was developed and its regret was shown 
to be O(lnT) [13,2]. Cover and Odentlich considered a more complex problem 
of universal portfolio and showed that again O(lnT) regret can be achieved [9]. 
So we conjecture that for the on-line MLM problem of HMMs there exists an 
algorithm that achieves O(lnT) regret. However, as suggested in [1], such an 
algorithm should be computationally infeasible (if it exists) because even in the 
off-line setting, obtaining a model that approximates to the best model M* is 
NP-hard. 

So we have to give up developing an efficient on-line algorithm that is com- 
petitive with the best off-line algorithm. Instead we let the on-line algorithm 
compete with the Baum-Welch algorithm used in off-line fashion. Namely, we 



^ The exponential families include many fundamental classes of distributions such as 
Bernoulli, Binomial, Poisson, Gaussian, Gamma and so on. 
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consider the following regret 

T T 

( 1 ) 

f=i f=i 

where Mb is the HMM obtained by running the Baum- Welch algorithm on 
until Pr(tn^|MB) converges. We call Mb the limit model, starting from the initial 
model Ml. Now we restate our first goal as follows: Minimize the regret of (1). 

In this paper we give a simple heuristic on-line algorithm based on the Baum- 
Welch algorithm. More specifically, in each trial t, we first get the HMM M' by 
feeding Mf and Wt to the Baum- Welch algorithm. Then we merge two models 
M' and Mf with some mixture rate 0 < < 1 to obtain a new model Mf+i. 

Note that the mixture rate % may depend on t. Actually, it is observed from 
our experiments that the choice of rating scheme influences the performance of 
the algorithm very much. Intuitively, a smaller % makes the algorithm more 
conservative, while a larger % makes the algorithm more adaptive to recent 
instances. We call this algorithm the on-line Baum- Welch algorithm. 

To see the performance of this algorithm, we did some experiments with 
artificially created data and compare the results to another on-line algorithm 
based on Gradient Descent (GD) method [3]. The results show that the on-line 
Baum- Welch algorithm with an appropriate rating scheme is slightly better than 
the GD algorithm. However it turns out that the rating scheme that gives small 
regret strongly depends on the data we deal with. 

One of the redeeming features of on-line algorithms is that they naturally 
adapt to the change of environment. We examined the adaptability of the two 
algorithms using instance sequences formed by concatenating several sequences 
that are generated by different sources, which are artificially designed HMMs. 
Surprisingly, the regret was negative, which means that the on-line algorithms 
perform well as compared to the off-line Baum-Welch algorithm. The reason 
why this happens is that the on-line algorithms tend to follow the change of 
sources, whereas the limit model Mb reflects the source’s structure which is in 
common with a variety of sequences in . To measure the adaptiveness of on- 
line algorithms, we introduce the notion of adaptive-regret. Suppose that all T 
trials are partitioned into p intervals Ti,. ■ ■ ,Tp with {l,...,T} = 7iU---U7^ 
such that the instances Wi = {wt \ t € 7i} in the ith interval are assumed to be 
generated by the Hh source. Let Mg be the limit model obtained by applying 
the off-line Baum-Welch algorithm on the instances W, in the Hh interval. Then 
the adaptive-regret is defined as 

T p p / 

^b) = E E - E 

t—1 i—1 tE7i 1 \tE7i tE7i 

Namely, to get small adaptive-regret, in every Hh interval the algorithm must 
perform as well as Mg without knowing the change of intervals. Our second goal 
is to minimize the adaptive-regret. We test the on-line algorithms on symbol 
sequences extracted from real speech data pronounced by several speakers. Our 
algorithms seem to adapt the change of speakers very well. 
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2 The Baum- Welch algorithm 



The integers denote the states of an HMM. In particular state 1 

is assumed to denote the initial state. The integers denotes obser- 

vation symbols as well. An HMM M is specified by a pair (A, B) of matri- 
ces. The first matrix A is of dimension [A", A"] and its (i,j) component aij 
denotes the transition probability of moving from state i to state j. The sec- 
ond matrix B is of dimension [N,K] and its (j,k) component denotes 
the probability of symbol k being output provided current state j. Clearly 
ttij = 1 and ^i,k = 1 must hold for every 1 < i < N. An HMM 

M = (A,B) generates a sequence of symbols. In what follows we assume that 
an HMM makes state transitions a certain times denoted 1. The HMM M per- 
forms I state transitions as follows: Starting in the initial state 1, it iterates 
the following for to = Suppose that the current state is Sm-i = * 

(Note that sq = !)• Then it chooses state j with probability aij and moves 
to Sm = j- After arriving at state j, it chooses symbol k with probability 
and outputs the symbol Wm = k. Let the sequence obtained be denoted by 
w = {w\, . . . ,wi) e A}*. The probability that the HMM M generates 

the sequence tn on a particular sequence of states s = si , . . . , s/ is given as 
Pt{w,s\M) = nL=i Hence the probability (likelihood) that M 

generates w is given as Pr{w\M) = X]sg{i n}‘ Pr(tn, s\M). We call a sequence 
w an instance. 

We are now ready to give the Baum- Welch algorithm. When given an ini- 
tial HMM Ml together with a sequence of instances = {wi, . . . ,Wt), the 
Baum- Welch algorithm tries to find the HMM M that maximizes the concave 
function Pr(tn^|Mi) log Pr(tn^|M) rather than Pr{w'^\M). This is a standard 
EM (Expectation-Maximization) technique (See, e.g. [8]) that guarantees that 
Pr{w'^\Mi) < Pr{w'^\M) holds. Solving this maximization problem we can de- 
rive the algorithm to update the parameters. Let #(i ^ j) and #(A: | j) be the 
random variables that denote the number of transitions from state i to state j 
and the number of symbol k output at state j, respectively, in a sequence of 
state transitions we consider. We compute the expected number of these vari- 
ables given that the sequence is observed. Here the expectation is taken 
with respect to the probability space specified by Mi. More precisely, for all 
1 < LJ < ^ and 1 < A: < A, we compute 

1 ^ 

t=l 



I i) I Wt,Mi), 

t=l 



and 
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where E{ ) denotes the expected number. We also compute the expected number 
of state i visited given that is observed, i.e., 

N 

Tii — ^ ^ • 

i=i 

These expectations can be efficiently computed by applying the well-known 
forward-backward algorithm to M\ = {Ai,Bi) initially given. Then we update 
the parameters according to 



^j,k 



Hi 

'^j,k 

Uj 



which give the new model M = {A, B). We call M obtained in this way the one- 
pass model from Mi. Moreover, repeating the above procedure with M\ = M 
until Pr{w'^\M) converges, we get the limit model Mb- Generally, the choice of 
the initial model Mi affects the performance of the limit model Mb ■ Especially, 
the topology (the number of states and the transitions of zero probability) of 
HMMs is inherited from the initial model through the Baum- Welch updates. 



3 On-line Baum- Welch algorithm 

In the on-line setting, a natural way to reestimate the parameters of the HMM 
would be to do exactly what the off-line Baum- Welch algorithm would do after 
seeing the last instance. This is called the incremental off-line update rule [2]. 
Namely, after seeing the tth instance Wt, the incremental off-line algorithm com- 
putes 

1 

1 

I j) I Wg,Ml), 

^ 9=1 
N 

i=i 

and then update the parameters according to Oij = n\ and hj^k = rffj k/^^j- 
Note that we can recursively compute these expectations very efficiently as fol- 
lows: 

= (^1 - ^ i) I ^i)’ (2) 

^1* = (^1 - I j) I Wt,Mi). 



(3) 
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This suggests that we should maintain the expectation parameters n\ j and to*, 
instead of directly maintaining the model parameters and hj^k- The above 
update rules for the expectation parameters can be viewed as a convex com- 
binations of the old parameters and the expectations for the last instance Wt, 
where the expectations are taken w.r.t. the initial model Mi. Clearly, the final 
model specified by nfj and mji. after seeing all instances is the same as the 
one-pass model from Mi. Since in most cases the performance of the one-pass 
model is much worse than that of the limit model Mb, we cannot expect that 
the regret of the incremental off-line algorithm is reasonably small. It should 
be noticed that throughout all trials the initial model Mi continues to be used 
for taking the expectations in (2) and (3), which would make the algorithm too 
conservative. 

In an attempt to adjust the model to the sequence more quickly, we use the 
most recent model Mt instead of Mi for taking the expectations. Accordingly 
we use some mixture rate % instead of 1/t in general. That is, we propose the 
following update rules: 

= (1 “*?*)**!/ j) I Wt,Mt), (4) 

'mik = (1 - + %E{#{k \ j) | Wt,Mt). (5) 

The model parameters {A, B) are implicitly represented through the expecta- 
tion parameters in terms of the followings: XlfLi &nd hj^k = 

rrij^k! SfLi ^j,s- We call this algorithm the On-line Baum- Welch algorithm. The 
way of specifying the mixture rate r]t is called the rating scheme. For instance, 
the rating scheme for the incremental off-line algorithm is rjt = 1/t. 

4 The GD algorithm 

The GD (Gradient Descent) algorithm is a general scheme for finding a maximal 
(minimal) point of a given function /. For the sake of completeness we briefly 
show how the GD algorithm works. When given an initial point x, it slightly 
changes the point to x' according to the rule x' = x + r]Vf{x) for some learning 
rate rj, where V f{x) denotes the gradient of / evaluated at x. Unfortunately, 
since the parameter space is bounded, i.e., aij,hj^k G [0,1]> the update might 
move the points out of the parameter space. To make the quantities stay in 
the parameter space, Baldi and Chauvin [3] introduced a transformation that 
maps the original parameter space (A,B) to an unbounded parameter space 
(U,V). More precisely, the algorithm maintains the parameters Uij and 
that represent the original parameters in terms of 




i 



bj,k = I ■ 

k 




162 



Jun Mizuno et al. 



It is clear that while the new parameter space is the unbounded one, the cor- 
responding parameters given by the above equations stay within the parameter 
space. Applying the GD algorithm in the {U, V) space, we have the following 
update rule for (U,V). Suppose that the current parameter is Mt = (U,V) and 
a new instance Wt is given. Then the GD algorithm update the parameters 
according to 



Kj = “hi + V 
v'j,k = Vj,k + V 



S = 1 / 

N \ 

Emk I j) I - 6,., ^ s) I wt,Mt) . 




Finally it produces the new HMM Mf+i = {U',V). 



5 Experiments 

We tested the performances of the on-line Baum- Welch and the GD algorithms 
by some experiments. 



5.1 Regret 

To see how the choice of the rating scheme % influences the performance of 
the on-line Baum- Welch algorithm, we examined the regret of the algorithm 
with various rating schema. The instance sequences we used are generated by 
some artificially designed HMMs with N = 3 and K = 2, and all instances 
Wt = {wt,i, ■ ■ ■ are of the common length I = 10. In this section we take 
two particular instance sequences, denoted wj and W 2 , respectively, which are 
generated by different sources. In our experiments, we only consider the polyno- 
mial rating scheme of the form of rjt = c{l/tY for some non-negative constants 
c and d. Recall that the regret of the On-line Baum- Welch algorithm (OBW, for 
short) is denoted by 

T T 

R{OBW,w^) =Y,L(wt,Mt) -Y,L{wt,MB), 

f=i f=i 

where Mt is the HMM that the OBW algorithm produces before seeing the 
instance Wt and Mb is the limit model for , starting from the same initial 
model as the one used by the OBW algorithm. Although the choice of the initial 
model also affects the regret, we simply chose it randomly in our experiments. 
Note that the regret on the sequence can be viewed as a function of T. We 
show in Fig. 1 how fast the regret grows with T for various polynomial rating 
schema % = {f/tY (We fixed c = 1). In particular, the regrets shown in Fig. 1- 
(a) and Fig. l-(b) are on the instance sequences wj and W 2 , respectively. From 
these experiments we can see that even the asymptotic behavior of the regret 
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(a) 




(b) 



Fig. 2. The regrets of the OBW algorithm with some of rating schema rjt = c(l/t)‘* 
compared with those of the GD algorithm with the optimally tuned learning rates, 
rj = 0.3 for wj and rj = 0.025 for Fig. (a) and Fig. (b) show the results for wj 
and W 2 , respectively. 
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depends both on the input sequence and on the rating scheme. Moreover, the 
best choice of d for the polynomial rating scheme depends on the input sequence. 

To compare the performance of the OBW algorithm with the GD algorithm, 
we applied the GD algorithm on the same sequences wj and We show 
in Fig. 2 the regrets R{GB,w^) of the GD algorithm as well as the regrets 
R{OBW ,w^) of the OBW algorithm with some rating schema % = c{l/tY. 
Again Fig. 2-(a) and Fig. 2-(b) are the regrets on wj and W 2 , respectively. For 
each sequence the learning rate rj for the GD algorithm is optimized: We chose 
r] = 0.3 for wj and rj = 0.025 for Namely, the optimal learning rate for the 
GD algorithm also strongly depends on the input sequence. In Fig. 2 we include 
the regrets of the OBW algorithm with constant rating schema rjt = c. These 
constants (c = 1/4 for wj and c = 1/60 for W 2 ) are chosen to minimize the re- 
grets. For both sequences, we can see that the OBW algorithm with appropriate 
choices of the rating scheme performed better than the GD algorithm with the 
optimal learning rate. 

5.2 Adaptability 

We did an experiment to see how the on-line algorithms adapt to changes of the 
signal source. To see this we used three different HMMs with N = K = 5. For 
each of them we get a sequence of 500 instances, hence obtaining three sequences. 
Again the length of all instances is set to I = 10. By concatenating these three 
sequences, we finally get a single sequence of length T = 1500. First we 
observe the following regret-per-trial of algorithm A € {OBW,GD}: 

Rt(A, w'^) = L(wt, Mt) - L(wt,M^) 

for each trial t. Here the model Mb is the limit model for the whole sequence . 
Thus the regret-per-trials sum up to the regret, i.e., R{A,w'^) = Rt{A,w'^)- 

In Fig. 3 we give the regrets-per-trial Rt{OBW ,w^) and Rt{GB,w^). We tuned 
the rates for both algorithms appropriately and set % = 0.03 for the OBW 
algorithm and rj = 0.25 for the GD algorithm. It is interesting to note that a 
diminishing rating scheme % with limf_j.oo % = 0 for the OBW algorithm no 
longer works because in the situation where the source is changing, the algo- 
rithm have to “forget” the far past data to follow the recent change. Intuitively, 
a constant rate makes the algorithm forget the past data exponentially fast. 

Surprisingly, we can see that the regrets R{OBW ,w^) and R{GB,w^) are 
negative. That is, the on-line algorithms beat the off-line Baum- Welch algorithm. 
Moreover, although the on-line algorithms perform badly at the moment that 
the source is changed {t = 500 and t = 1000), they rapidly adapt the new source. 

Next we compare the performance of the on-line algorithms with switching 
HMMs, rather than with the single model Mb. More precisely, for each interval 
Ti = {500(i — 1) -I- 1, ... , 500i} {i e {1, 2, 3}), let Mg be the limit model for the 
instances {wt \ t € Ti} in the ith interval. In other words. Mg models the source 
of the i th interval. Now we consider the following adaptive regret-per-trial of 
algorithm A-. 

ARt(A,w^) = L(wt,Mt) - L(wt,Mi), 
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Fig. 3. The regret-per-trial for a sequence generated by changing sources. The source 
is changed at every 500 trials. The rating scheme and the learning rate for the OBW 
and the GD algorithms are tuned appropriately: 7]t = 0.03 for OBW and ?y = 0.25 for 
GD. 
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Fig. 4. The adaptive regret-per-trial. The setting of the experiment is same as in Fig. 3. 
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where t G Ti, and define the adaptive regret as 

T 

AR{A, w^) = ^ ARt{A, w^). 

f=i 

Fig. 4 shows the adaptive regret-per-trial of the OBW and the GD algorithms. 
From this we can see that the on-line algorithms adapt the HMM to the source 
of each interval without knowing when the source is changed. 

5.3 Experiments on speech data 






Fig. 5. Left-Right model 



We did the same experiments on natural speech data. The data consists of 
the acoustic waveforms pronounced by 10 speakers. There are 503 sentences 
per speaker. The acoustic waveforms was transformed to symbol sequences by 
the vector quantization (VQ) method with 1024 symbols, and then segmented 
into subsequences that correspond to phonemes. We then extract from them 
subsequences that correspond to vowels. Namely, for each speaker i and each 
vowel p e {/a/, /o/, /u/, /i/, /e/}, we made a set of sequences Wi^p. A member 
of Wi^p is a symbol sequence (an instance) of vowel p pronounced by speaker i. We 
then concatenated the 10 sets of all speakers for a particular vowel p to make an 
instance sequence . That is, is a sequence of instances in Wp = Ul£i ^i,p- 
For example, for p = /i/, Wi^p contains 1818 instances of p for each speaker i. 
So is a sequence of length T = 18180 and the speaker changes at every 1818 
trials in . The length of instances ranges from 3 to 15. For this sequence 
we tested the performance of the OBW and the GD algorithms. Here we restrict 
the topology of HMM to the left-right model with three states (See Fig. 5), 
which has been successfully utilized in speech recognition to model phonemes. 
In Fig. 6 we show that the adaptive regret-per-trial for the instance sequence 
from W/i/. The algorithms seem to adapt the change of speakers very well. 
The regret R{A,w'^) and the adaptive regret AR{A,w'^) for A € {OBW,GD} 
is shown below. The OBW algorithm seems to perform slightly better than the 
GD algorithm. 



A 


R(A,w ^ ) 


AR(A,w ^ ) 


OBW 


-117630 


65397 


GD 


-108631 


74396 
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Fig. 6 . The adaptive regret-per-trial for IT/i/. The speaker changes at every 1818 trials. 
The rating scheme for the OBW algorithm is rjt = 0.003 and the learning rate for the 
GD algorithm is 77 = 0.25. 



6 Concluding remarks 

We proposed an on-line variation of the Baum- Welch algorithm (the OBW al- 
gorithm) and evaluated the performance of the algorithm in terms of the regret, 
which measures the relative performance compared to the off-line Baum- Welch 
algorithm. As far as we know, such a competitive analysis is the first attempt for 
parameter estimation of HMMs. Throughout the experiments, the OBW algo- 
rithm seems to perform marginally better than the GD algorithm. In particular, 
the algorithm could be applied to speaker adaptation. 

However, it is not clear how the rating scheme rjt should be chosen. Intuitively, 
while the instances come from a single source, the rate should be somehow 
decreasing with time. But it should be quickly recovered as the source changes. 
So we have to develop a rating scheme that depends on the past sequence. 

Another method to make the algorithm adaptive would be to make use of the 
idea behind the variable share algorithm, which is developed for on-line linear 
regression where the best regressor changes with time [5]. At the end of each 
trial the algorithm shares a fraction of parameters to others according to the 
likelihood for the last instance, so that if the likelihood becomes small (which 
is interpreted as the source changed) , the parameters of too small value quickly 
get recovered. 

Theoretically it is interesting to investigate the worst case regret of the al- 
gorithms. We would like to have o{T) regret for any input sequence. But Fig. 1 
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suggested that the regret of both the OBW and the GD algorithms grow linearly 
in T. The performance could be improved by using a more sophisticated rating 
scheme and/or the sharing method. 
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Abstract. In this paper, we address computational complexity issues of 
decompositional approaches to if-then rule extraction from feed-forward 
neural networks. We also introduce a computationally efficient technique 
based on ordered-attributes. It reduces search space significantly and 
finds valid and general rules for single nodes in the networks. Empirical 
results are shown. 



1 Introduction 

How to extract if-then rules from a trained feed-forward neural network has 
been investigated by researchers [1, 2, 3, 4, 5, 7, 8]. However, the rule extraction 
processes are computationally expensive since the rule search space is increased 
exponentially with the number of input attributes (i.e., input domain). Several 
techniques have been introduced to reduce the searching complexity of decom- 
positional approaches such as KT heuristics and complexity control parameters 
[3, 4, 5] described in Sect. 2. Some other techniques include the soft-weight- 
sharing used in MofN algorithm [9]. It clusters the weights into equivalence 
classes. This approach, however, requires a special training technique and is not 
sufficiently accurate description of the network. 

This paper summarizes the existing heuristics used to reduce the rule search 
space and proposes a computationally efficient technique, the ordered- attribute 
seareh. The proposed algorithm extracts valid and general rules from single nodes 
in 0(n log n) for most cases. Empirical results are also provided. 

2 Complexity Issues in Rule Extraction 

2.1 If-Then Rules 

A rule has the form “if the premise, then the eonelusion.” The premise is com- 
posed of a number of positive attributes (e.g., Xi) and negative attributes (e.g., 
notxi), and so is the conclusion (e.g., c, and notci). A rule is called a confirming 
rule if the conclusion is c, or a disconfirming rule if the conclusion is not Ci . In 



S. Arikawa and S. Morishita (Eds.): DS 2000, LNAI 1967, pp. 170-181, 2000. 
© Springer-Verlag Berlin Heidelberg 2000 




Computationally Efficient Heuristics for If-Then Rule Extraction 



171 



the basic form of a rule, the rule’s premise is limited to a conjunction of at- 
tributes and the rule’s conclusion is limited to a single attribute (i.e., a single 
class). However, the presence of multiple rules with same conclusion represents 
disjunction. A rule with a conjunction of conclusions is represented by multiple 
rules with same premise but different conclusions. 

Quality of a rule is evaluated with a few criteria. First of all, a rule should be 
valid. The validity for a rule is defined as follows. Whenever the rule’s premise 
holds, so does its conclusion in the presence of any combination of the values of 
attributes not referenced by the rule [3]. Other criteria include accuracy (speci- 
ficity or goodness of fit) and generality (simplicity). Accuracy is about how often 
the rule is classified correctly. Generality is about how often the left-hand side 
of a rule occurs. It is also related to coverage of the rule. As the rule gets sim- 
pler(i.e., shorter), it covers more instances in the input domain. 

2.2 Decompositional Approach 

Decompositional approaches to rule extraction from a trained neural network 
(i.e., a feed-forward multi-layered neural network) involves the following phases: 

1. Intermediate rules are extracted at the level of individual units within the 
network. 

2. The intermediate rules from each unit are aggregated to form the composite 
rule base for the neural network. It rewrites rules to eliminate the symbols 
which refer to hidden units but are not predefined in the domain. In the 
process, redundancies, subsumptions, and inconsistencies are removed. 

At each non-input unit of a network, n incoming connection weights and a 
threshold are given. Rule extraction at the unit finds a set of incoming attribute 
combinations that are valid and maximally-general. Fig. 1 gives an example: A 
combination {xi^x^) is a valid rule because its weight-sum (i.e., 3 -I- 4 = 7) is 
always greater than the threshold (i.e., — 1 ) regardless of the values of other 
incoming units. Another example is (not 2 : 3 ) which is also valid. 




Fig. 1. An individual node with five incoming connection weights and a threshold. 
Input values for input nodes are binary: 1 or 0. 
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2.3 Computational Complexity 

Decompositional algorithms extract intermediate rules from each non-input nodes 
and then rewrite them to form composite rule base for the network. In this paper, 
we focus on the computational complexity in the rule extraction from a single 
node with n incoming binary attributes. Given the n attributes, there are 2" 
(i.e., ^"^0 possible combinations of the attributes and 3" (i.e., X]/Lo 
possible rules in the rule space. 

For a combination with I attributes (i.e., 0 < / < n), each attribute could be en- 
coded positive or negative. Therefore, there are 2* possible rule representations 
for the combination. For each combination of size I, we discard 2* — 1 representa- 
tions for the combination from rule search space instantly by the following two 
steps: 

1. For each attribute, the attribute is encoded positive if its corresponding 
weight is positive; it is encoded negative otherwise. For example, attributes 
{x \ , 2:3 , 2:5) in Fig. 1 is represented as “2:1 and not 2:3 and 2:5” for its candidate 
rule. The lowest total-input for a candidate rule R is defined as follows: 

n ( WiXi = |wj| if Xi G P 

total-input(Q„,g^((i?) = ^^WiXi, < 2;, = 1 if w, < 0 and Xi ^ P (1) 

[ 2;, = 0 if Wj > 0 and Xi ^ P 

where the P is a set of attributes in the premise of R. 

2. Validity is tested for the candidate rule. By the definition of validity, the 
candidate rule is valid if its lowest total-input is greater than its threshold. 

Validity testing costs 0(1) for each combination and thus this method reduces 
rule search space from 3" to 2". 

It is NP-hard to obtaining all possible combinations of rules and a feasible al- 
ternative is often to extract key rules that cover most of the domain knowledge. 
In this paper, we want to search b best rules, that is, the most simple but valid 
rule combinations among all 2" possible combinations of incoming weight con- 
nections. 



2.4 Tree-Based Search 

Tree-based search uses a tree structure where each node is one of 2" combina- 
tions. At depth I in the search tree, length of combinations is 1. The tree-based 
search begins with a root node (i.e., 0 or the shortest length combination) to a 
leaf node (i.e., n or the longest length combination), checking its validity. If it 
finds a valid rule combination, it stops generating children nodes and takes it as 
a rule with the following property: “if a combination of a node is a valid rule, 
combinations of its descendant nodes are eliminated from the search tree space. ” 
which reduces search space. For 6 = 1, the worst case 0(2")occurs when the rule 
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is a leaf node combination of length n. The worst case for any 6 > 1 occurs at 
C /+1 <b < Cp where \^~\ < I <n. Its worst-case search cost for any b is : 

1 ■fl n 

Cp) < 0(Tree-based search) < 0(^ Cp) = 0(2”) (2) 

/=o /=o 



KT Heuristics. Fu’s KT algorithm [2, 3, 4] is an improved tree search algorithm 
which reduces search space and memory space heuristically. The KT algorithm 
reduces the search space by dividing the n incoming attributes into p positive 
and q negative attributes {si. p + q = n). The basic idea of the separation derives 
from the fact that the complexity of an exponential space is greater than the 
sum of complexities of its sub-spaces. That is, > {a ■ xp + {b ■ xp where 
I > l,a< 1,6< 1 and a + b = 1. For each combination in positive attribute 
sub-space, its validity is tested. If the positive attribute combination is not valid, 
negative attribute sub-space is searched to form a valid combination composed 
of both positive and negative attributes. Thus the search space size for positive 
attribute sub-space is = 2^^. Even though this approach reduces the 

complexity in most cases, the worst case complexity remains same Up = n (i.e., 
no negative attributes). If every positive attribute combination needs negative 
attribute search, the search space is also 2^-2'^ =2". 



Complexity Control Parameters. In addition to the heuristics, some pa- 
rameters are introduced to control the complexity of the rule search procedure 
[5] . Some of the parameters include : 

— Representative attributes: When there are too many incoming connections to 
concept nodes, only highly weighted connections are considered. This makes 
sense because input summation seems to be determined by a few highly 
weighted (highly relevant) connections, not by all the connections. Thus, we 
can set a threshold on the weight or pick a certain percentage of attributes 
with high pertinence. The attributes selected are called representative at- 
tributes. 

— Certainty factor threshold {0cf)- The certainty factor(CF) of a rule is de- 
fined as the sigmoid transformation of the value obtained by subtracting the 
threshold from the weight sum of all the attributes in the rule’s premise. If 
the CF is in [0.5, 1.0], the rule is a confirming rule. If the CF is in [0.0, 0.5], 
the rule is a disconfirming rule and convert it into [—1.0,— 0.5]. We may 
discard rules with the CF lower than a predefined threshold. 

— Rule simplification and merging similar rules: The generated rules are refined 
by simplifying them using symbolic logic. 

The parameters control the trade-off of quality of rules and searching complexity. 
However, finding optimal values for the parameters is mostly accomplished trial 
and error and might add another computational complexity to the process. 
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3 Ordered- Attribute Search 

Gallant [6] describes a procedure to find a single rule to explain the conclusion 
reached by the neural network given a case. His method involves the ordering of 
attributes based on absolute weight values. However, he argues that the proce- 
dure works only if the network was very small; the number of implicitly encoded 
in-then rules can grow exponentially with the number of incoming units. In this 
section, we introduce a different search algorithm called the ordered- attribute 
seareh (OAS) algorithm. The OAS algorithm involves the following three proce- 
dures: 

1. Attribute Contribution Scoring: For each attribute, its contribution to a 
candidate rule is calculated. 

2. Attribute Sorting: Attributes are sorted in descending order according to 
their contribution scores. 

3. Rule Searching: With the attributes sorted by their contribution scores, valid 
and maximally general rules are to be searched. 

Consider a rule search tree. A root node (i.e., length 0) represents a de- 
fault rule candidate Rde fault- Adding attributes to the Rde fault increases (or de- 
creases) its lowest total-input by the amount of weight values of the attributes. 
For binary attribute cases, the contribution score C{xi) of an attribute Xi is 
defined as follows: C{xi) = |wj|. Therefore, a positive (or negative) attribute 
Xi added increases the lowest total-input of Rde fault by C(xi). The csum(R) 
is defined as a sum of contribution scores of attributes in the premise of the 
candidate rule R. Then lowest total-input for the R is defined as follows: 

total-input, = csum{R) -b total-input, ou,e«f(-R-ie/a«/f) (3) 

Note that total-input,g„,g^,(i?) is proportional to csum{R) since the the lowest 
total-input of Rdefauit is constant. 

After the contribution scores are calculated, attributes are sorted in descend- 
ing order by their contribution scores as follows: oi, 02, as, . . . a«. Then, we use 
the generic algorithm that lists the combinations of length I in the following 
order: 

(ai, 02, . . . a,_i, a,), . . . (oi, 02, . . . a,_i, a„), . . . (oi, 03, . . . a,+i), . . . (a„_,+i, . . . a„) 

(4) 

At each depth I (i.e., 1 < / < n) in an OAS search tree, the number of combi- 
nations is C," and they are listed from the left to the right as defined at listing 

(4). 

Corollary 1. In the list of eomhinations of length I, the first eomhination (i.e., 
the left-most one) holds the highest esum. The last one in the list holds the lowest 
esum. 

Due to Corollary 1, it is straightforward to find the valid combination of shortest 
length. Starting from the depth 1 at an OAS search tree, only the first node (i.e.. 
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combination) at each depth is tested for validation. If it is not valid, the other 
nodes at the depth are not valid either(by Corollary 1). Worst case complexity 
for this process is 0{n). Let’s say that the first valid combination is found at 
depth d after d tests. Then the following lemma is true. 

Lemma 2. All the combination nodes from depth 0 to depth (d-1) are not valid. 



Propositions. By the definition of csum and combination listing in listing (4), 
the following inequalities are true: 

1 . CS'Wni{cis.y , Cis2 ; ■ ■ ■ ; ^Si ) ^ CS'Wni{cis.y ; ■ ■ ■ ; ^Si — I ; ^ii>si') 

2 . CSUm{as^ J ■ ■ ■ J 0,Sj J ■ ■ ■ J O^si ) ^ csum{as^ J ■ ■ ■ J J >Sj J + j 

■ ■ ■ j 0,ii>max(si + 

3. CSUm{cisi ^ Cls 2 ^ • • • ^ ^ CSUmi^Oi^ >5i ; >maa:(52,h + l) ’ 

• • • ; (^ii>max(si + • 



The V set of a combination R is defined as the set of all combinations with 
smaller csums, as defined by the three inequalities in Proposition 3. Then the 
the following is true. 

Lemma 4. Given a combination at depth d, if the combination is not valid, 
combinations of its V set are not valid either. 

The three inequalities in Proposition 3 reduces the search space significantly. 
At depth I, validity is tested from the first combination in the list; valid one 
is stored in a rule candidate set and then next one in the list is tested; if it 
is not valid, its V set is eliminated from the search space instantly. For exam- 
ple, let the smallest size I be 3 and the number of attributes n be 10. Then 
the number of possible combinations is Cg® = 120. Consider a combination 
R = (oi, og, oe). By the inequality 1 in Proposition 3, lowest total-input of R is 
greater than the ones of (ai,ag,a 7 ) . . . (ai,ag,aio). Thus, if the R is not valid, 
'I2k^e+i 1=4 combinations are eliminated from the search space. By the in- 
equality 2, csum of R is greater than the ones of (ai,ajy3,af.>max(6,j+i))- H 
eliminates J2k=max(6 j-i-i) 1 = 20 combinations. By the inequality 3, 

csum of R is greater than the ones of (aj>i , Oj>,nax( 3 ,i+i) , ak>max(6,j+i))- It elim- 
inates EjJi+i E]ZLx( 3 ,i+i) ELma*(6 ,i+i) 1 = ^0 Combinations. The csum{R) 
is greater than csums of 104 other combinations. So is the lowest total-input of 
the R. This heuristic reduces the search space significantly. Fig. 2 illustrates the 
number of combination that can be eliminated when current combination (i.e., 
combination id) in the list is not valid. 

Another heuristic in the OAS is from the property of tree structure: If a 
combination at depth I is valid, its child combinations at depth I + 1 are also 
valid. Therefore the combinations subsumed to a valid rule (i.e., parent node) 
can be discarded from the search space. From the OAS heuristics, the following 
types of combinations are eliminated from the search tree: 
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Fig. 2. Number of combinations to be eliminated from search space when current 
combination is not valid. 



1. invalid combinations (Lemma 2 and 4) and 

2. valid combinations which are subsumed to a valid combination. 

All other combinations at the tree are subject to a validity test for rule candidacy. 

In the problem of rule extraction from a neuron unit which involves n incom- 
ing binary attributes, we want to find the b rules which are valid and maximally 
general (i.e., the rules having shorter length than others). Even though there 
are 2" possible rules, only C'p„/2l unique rules are possible in maximum because 
longer length rules are subsumed to a shorter length rule. Hence the following is 
true: 

Contribution scoring procedure costs 0{n). Sorting phase costs 0(n log n) with 
well-known sorting algorithms such as heap sort. Once attributes are sorted, 
searching the best rule costs 0{l) where 1 < I <n. The I is the smallest length 
of valid rules to be searched. When b is greater than 1, the search starts from 
length I combinations, eliminating combinations from search space. 

Even when searching C" combinations, it reduces a large part of search space 
by Proposition 3. Thus complexity of the OAS is : 

0(n log n) < 0{0AS) < 0(n log n) -b 0{Cf^^). (5) 

In summary, worst-case complexities of three search algorithms for any b are 

O (Exhaustive search) = 0(2") 

0(2"“^) < 0(Tree-based search) < 0(2") (6) 

O(nlogn) < 0(0rdered-attribute search) < 0(2"“^). 
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4 Empirical Results 

The OAS algorithm is applied to public domains: XOR, Iris and Hypothy- 
roid. For XOR problem, we show intermediate rules extracted from each non- 
input units and final rules rewritten with the intermediate ones. An example 
of rewriting procedure is also provided. For Iris and Hypothyroid, we show 
the performance of individual rules and rulesets generated by the OAS. Per- 
formance of an individual rule is evaluated by coverage and accuracy defined 
by: coverage = (pos + neg) /total and rule accuracy = pos/(pos + neg) where 
pos is the number of positive instances, neg is the number of negative instances, 
and total is the number of instances in the test set. Note that {pos +neg) is the 
number of instances covered by the rule, and thus total = pos + neg + notcovered 
where the notcovered is the number of instances not covered by the rule. Perfor- 
mance of a ruleset is evaluated by the following: coverage = (pos + neg) /total, 
confidence = pos/{pos + neg) and classification accuracy = pos/total. The 
accuracy of a ruleset is defined differently than that of a single rule. It is the 
performance of a ruleset classifier over the complete domain. 

4.1 XOR 

A fully-connected network is configured with 2 binary input attributes(i.e., xO 
and xl), 4 hidden units (i.e., hO, hi, h2 and h3) and 1 output unit (i.e., y). 
Intermediate rules extracted from the non-input nodes and final rules are listed 
in Table 1. Rewriting procedure aggregates intermediate rules by eliminating the 
hidden unit symbols which are not predefined in the domain. Rewriting starts 
with the output layer and rewrites rules of one layer every time in terms of rules 
of the next layer closer to the input of the network For example, we rewrite the 
rule R12 from output unit y “if not hO and not li2 then y —0.966”. Since the 
“not hO” and “not h2” are not defined in the domain, they are replaced by R1 
and R6, respectively. After simple logical refinement, final rule FR12 is obtained. 
Note that R14 is rewritten into two final rules FR14-1 and FR14-2 while R15 
does not form any valid final rules. The OAS generates a complete set of final 
rules for the XOR problem. 

4.2 Iris Domain 

The Fisher’s iris domain involves 4 continuous-valued features and 3 concepts. 
Each continuous- valued feature is discretized to three interval attributes, result- 
ing in a total 12 binary input attributes. A three-layered neural network (12-4-3) 
is configured. Two experiments are performed: (1) comprehensive capability is 
evaluated by looking at how well training data instances are translated into a 
small set of concise rules, and (2) prediction capability is evaluated by looking 
at how well the extracted ruleset classifies unseen data instances, compared with 
the network. 

A network is trained with a data set of 150 instances (50 instances for each con- 
cept) and accuracy is 98.67%, with only two instances mis-classified. The OAS 
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Intermediate rules 


from hO 


Rl. 


xO and xl 


then 


hO -0.827 




R2. 


not xO 


then 


hO 0.785 




R3. 


not xl 


then 


hO 0.780 


from hi 


R4. 


xO and xl 


then 


hi -0.787 




R5. 


not xO and not xl 


then 


hi 0.778 


from h2 


R6. 


xO and xl 


then 


h2 -0.847 




R7. 


not xO 


then 


h2 0.819 




R8. 


not xl 


then 


h2 0.805 


from h3 


R9. 


xO 


then 


h3 -0.980 




RIO. 


xl 


then 


h3 -0.980 




Rll. 


not xO and not xl 


then 


h3 0.937 


from y 


R12. 


not hO and not h2 


then 


y -0.966 




R13. 


h3 


then 


y -0.974 




R14. 


hO and h2 and not h3 then 


y 0.992 




R15. 


hi and h2 and not h3 then 


y 0.885 


Rewritten final rules 




FR12. 


xO and xl 


then 


y -0.966 




FR13. 


not xO and not xl 


then 


y -0.974 




FR14-1. 


not xO and xl 


then 


y 0.992 




FR14-2. 


xO and not xl 


then 


y 0.992 



Table 1. XOR: The intermediate rules and final rules. 



(Ordered-Attribute Search) algorithm is applied to the network and nine rules 
are obtained Table 2. Performance of the ruleset and each individual rules are 
evaluated over the 150 instances and are listed in Table 3. 

For prediction accuracy evaluation, the set of 150 instances is divided into 2 
sets with 75 instances each. A networkl is trained with a training set (i.e., 
datasetl)and evaluated with a test set (i.e., dataset2). A rulesetl is generated 
from the networkl and evaluated with the test set. A network2 and a ruleset2 
are generated with dataset2 and evaluated with dataset 1. Prediction accuracy 
of the networks and extracted rulesets are illustrated in Table 4. 

4.3 Hypothyroid Disease Domain 

The hypothyroid disease data set involves 2 concepts (hypothyroid and nega- 
tive) and 26 variables: 7 continuous-valued and 19 binary attributes. The data 
set contains 3163 instances: 151 hypothyroid and 3012 negative instances. Some 
instances contain several missing values. 

Since the data set involves continuous-valued variables, missing values and un- 
balance between the two concept instances, preprocessing is accomplished before 
experiments are performed. At first, the instances that include missing values 
in attribute TSH, TT4 and FTI are filtered out, leaving 2694 instances (150 
hypothyroid, 2544 negative). To collect the same number of instances for each 
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Rl. (petal-length< 2.7) and (3.1 <septal-width) then setosa 0.952 

R2. (petal-length< 2.7) and (petal-width< 0.7) then setosa 0.845 

R3. (petal- width< 0.7) and (3.1 <septal-width) 

and (septal-length< 5.4) then setosa 0.845 

R4. (2.7 <petal-length< 5.0) and (3.1 <septal-width) 

and (5.4 < septal-length< 6.3) then versicolor 0.992 

R5. (2.7 <petal-length< 5.0) and (0.7 <petal-width< 1.6) then versicolor 0.992 
R6. (2.7 <petal-length< 5.0) and (6.3 <septal-length) 

and not (septal-width< 2.8) then versicolor 0.992 

R7. (5.0 <petal-length) and (1.6 <petal-width) then viginica 0.969 

R8. (5.0 <petal-length) and (septal-width< 2.8) then viginica 0.969 

R9. (1.6 <petal-width) and (septal-width< 2.8) then viginica 0.969 



Table 2. Iris: Nine individual rules. 



Rule Size Coverage(%) Accuracy(%) 



Rl 


2 


30.0 


100 


R2 


2 


50.0 


100 


R3 


3 


28.0 


100 


R4 


3 


2.0 


100 


R5 


2 


45.5 


98.9 


R6 


3 


10.0 


100 


R7 


2 


39.5 


98.7 


R8 


2 


12.5 


96 


R9 


2 


16.5 


96.97 



Ruleset Size Coverage(%) Confidence(%) Classification(%) 
9 98.67 98.65 97.33 



Table 3. Iris: Performances of individual rules and Comprehension accuracy of 
a ruleset. 



Prediction(%) 
networkl 98.7 

network2 98.7 



Ruleset Size Coverage(%) Confidence(%) Prediction(%) 
ruleset 1 11 9T3 9R6 9R0 

ruleset2 7 100.0 98.7 98.7 



Table 4. Iris: Prediction accuracy of networks and rulesets. 
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concepts, 150 negative instances are selected randomly from the 2544 negatives. 
Thus, a data set of 300 instances (150 hypothyroids and 150 negatives) is ob- 
tained for experiments. The data set is divided into two different sets, each 
containing 150 instances (75 hypothyroids, 75 negatives). One set is used for 
training and the other set for testing. Among the 26 variables, 24 variables are 
used for this experiment because variable TGB includes too many missing values 
(91.8% missing). Thus two variables related to TGB (TGB and TGB measured) 
are excluded in this experimental data set. Six continuous- valued attributes are 
discretized to several binary attributes according their interval distributions, re- 
sulting in total 52 binary attributes. 

A 3-layered neural network is configured (52-5-1) and trained with a training 
set. Rules are extracted from the trained network using the OAS algorithm and 
evaluated over training and testing set. Performances of neural networks and 
extracted rulesets are listed in Table 5. Neural network 1 is the one trained on 
Setl, and network 2 is trained on Set2. Rulesetl is a set of rules extracted from 
neural network 1 and ruleset2 is from network2. Individual rules in ruleset2 are 
listed in Table 6 and their performance (coverage and accuracy) are in Table 7. 



Classification (%) 
setl set 2 

Network 1 (52-5-1) 98.7 96T 

Network 2 (52-5-1) 97.3 98.7 



Coverage(%) Confidence(%) Classification (%) 


setl 


set 2 


setl 


set 2 


setl 


set 2 


Rulesetl (7 rules) 75.33 


64.7 


100 


98.97 


75.33 


64.03 


Ruleset2 (8 rules) 88 


87.3 


100 


99.24 


88 


86.64 



Table 5. Hypothyroid: Performance of two rulesets extracted from two neural networks 



5 Conclusions 

We addressed computational complexity issues in extracting if-then rules from a 
trained feed-forward neural networks and summarized related heuristics. In this 
paper, we introduced a computationally efficient technique based on ordered- 
attributes. This technique extracts valid and general rules from single nodes in 
0(n log n) most cases and less than 0(2"“^) worst case. It is applied to well- 
known public domain data sets and the experimental results are provided. The 
extracted rules are evaluated with coverage, accuracy and confidence and shown 
to be efficient and comprehensive. 
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Rl. (FTK 70) and (TSH> 7.3) and not (60 <age< 80) 
R2. (FTK 70) and (TSH> 7.3) and (Female) 

R3. (FTK 70) and (TSH> 7.3) and (TT4< 68) 

R4. (FTK 70) and (TSH> 7.3) and (0 <T3< 1) 

R5. (FTI> 70) and (TSH< 7.3) and (Male) 
and (on-thyroxine=f) 

R6. (FTI> 70) and (TSH< 7.3) and (TT4> 68) 
and (on-thyroxine=f) 

R7. (FTI> 70) and (TSH< 7.3) and (1 <T3< 2) 

R8. (FTI> 70) and (TSH< 7.3) and (thyroid-surgery=f) 
and (on-thyroxine=f) and not (40 <age< 60) 



then hypothyroid 0.906 
then hypothyroid 0.906 
then hypothyroid 0.906 
then hypothyroid 0.886 

then negative 0.946 

then negative 0.946 
then negative 0.946 

then negative 0.946 



Table 6. Hypothyroid: Eight individual rules in ruleset2. 



Rule Size Coverage(%) Accuracy(%) 



Rl 


3 


29.3 


100 


R2 


3 


34.7 


100 


R3 


3 


46.7 


100 


R4 


3 


26.7 


100 


R5 


4 


15.3 


100 


R6 


4 


39.3 


100 


R7 


3 


19.3 


100 


R8 


5 


30.7 


100 



Table 7. Hypothyroid: Performances of individual rules in ruleset2. Evaluated 
on test set(setl). 



In this work, we focused on domain complexity of discrete input attributes. How- 
ever, complexity in continuous-valued domain is much more difficult and how to 
handle it is very important research issue. Another issue that needs to be inves- 
tigated is the intermediate rule aggregation complexity in the decompositional 
approaches. The complexity is also increased with the number of intermediate 
rules extracted from single nodes. 
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Abstract. We consider inductive language learning from positive ex- 
amples, some of which may be incorrect. In the present paper, the error 
or incorrectness we consider is the one described uniformly in terms of 
a distance over strings. Firstly, we introduce a notion of a recursively 
generable distance over strings, and define a fc-neighbor closure of a lan- 
guage L as the collection of strings each of which is at most k distant 
from some string in L. Then we define a fc-neighbor system as the collec- 
tion of original languages and their j-nsighbor closures with j < fc, and 
adopt it as a hypothesis space. In ordinary learning paradigm, a target 
language, whose examples are fed to an inference machine, is assumed to 
belong to a hypothesis space without any guarantee. In this paper, we 
allow an inference machine to infer a neighbor closure instead of the orig- 
inal language as an admissible approximation. We formalize such kind of 
inference, and give some sufficient conditions for a hypothesis space. 



1 Introduction 

In many real-world applications of machine discovery or machine learning from 
examples, we have to deal with incorrect examples. In the present paper, we 
consider language learning from observed incorrect examples together with cor- 
rect examples, i.e., from imperfect examples. Some correct examples may not be 
presented to the learner. It is natural to consider that each observed incorrect ex- 
ample has some connection with a certain correct example on a target language 
to be learned. The incorrect examples we consider here are the ones described 
uniformly in terms of a distance over strings. Assume that the correct example 
is a string v and the observed example is a string w. In case we are considering 
the so-called Hamming distance and two strings v and w have the same length 
but differ just one symbol, then we estimate the incorrectness as their distance 
of one. In case we are considering the edit distance and w can be obtained from 
V by deleting just one symbol and inserting one symbol in another place, then we 
estimate the incorrectness as their distance of two. Firstly, we introduce a notion 
of a recursively generable distance over strings, and define a fc-neighbor closure 
of a language L as the collection of strings each of which is at most fc distant 

* Supported in part by Grant-in-Aid for Scientific Research on Priority Areas No. 

10143104 from the Ministry of Education, Science and Culture, Japan. 



S. Arikawa and S. Morishita (Eds.): DS 2000, LNAI 1967, pp. 183-195, 2000. 
© Springer-Verlag Berlin Heidelberg 2000 




184 



Yasuhito Mukouchi and Masako Sato 



from some string in L. Then we define a fc-neighbor system as the collection of 
original languages and their ^'-neighbor closures with j < k, and adopt it as a 
hypothesis space. 

There are various approaches to language learning from incorrect examples 
(cf. e.g. Jain [5], Stephan [15], and Case and Jain [3]). Stephan [15] has formu- 
lated a model of noisy data, in which a correct example crops up infinitely often, 
and an incorrect example only finitely often. There is no connection between 
incorrect examples considered there and correct examples. 

In 1967, Gold [4] introduced a notion of identification in the limit. An infer- 
ence machine M is said to identify a language L in the limit, if the sequence of 
guesses from M , which is successively fed a sequence of examples of L, converges 
to a correct expression of L, that is, all guesses from M become a unique expres- 
sion after a certain finite time and that the expression is a correct one. In this 
criterion, a target language, whose examples are fed to an inference machine, is 
assumed to belong to a hypothesis space which is given in advance. However, 
this assumption is not appropriate, if we want an inference machine to infer or 
to discover an unknown rule which explains examples or data obtained from 
scientific experiments. In their previous paper, Mukouchi and Arikawa [10] dis- 
cussed both refutability and inferability of the hypothesis space concerned from 
examples. If a target language is a member of the hypothesis space, then an 
inference machine should identify the target language in the limit, otherwise it 
should refute the hypothesis space itself in a finite time. They showed that there 
are some rich hypothesis spaces that are refutable and inferable from complete 
examples (i.e., positive and negative examples or an informant), but refutable 
and inferable classes from only positive examples are very small. In relation to 
refutable inference, Lange and Watson [9] and Mukouchi [12] also proposed in- 
ference criteria relaxing the requirements of inference machines, and Jain [6] also 
deals with the problem for recursively enumerable languages. On the other hand, 
Mukouchi [11] took a minimal language as an admissible approximate language 
within the hypothesis space, and forced an inference machine to converge to an 
expression of a minimal language of the target language which may not belong to 
the hypothesis space. Kobayashi and Yokomori [7] also proposed inference crite- 
rion requiring an inference machine to infer an admissible approximate language 
within the hypothesis space concerned. 

As mentioned above, the obtained examples may have errors, and thus the 
observed language consisting of the observed examples may not belong to the 
hypothesis space, even when the original target language belongs to the hypoth- 
esis space. Therefore we have to take account of languages not belonging to the 
hypothesis space concerned. In this paper, we also take a minimal language as an 
admissible approximate language within the hypothesis space for the observed 
language. By doing this, we guarantee that if the observed examples have no er- 
rors and the target language is in the hypothesis space, then an inference machine 
converges to a correct expression of the target language. Furthermore, by taking 
a fc-neighbor system as a hypothesis space, we can expect an inference machine 
to infer an original target language, even when the observed examples have some 
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errors. Roughly speaking, an inference machine M fc-neighbor-minimally infers 
a class C from positive examples, if for every observed language L, M converges 
an expression of L’s minimal language within a fc-neighbor system which is least 
distant from some language in C. 

We formalize fc-neighbor-minimal inferability, and give some sufficient con- 
ditions for a hypothesis space. Finally, as its application, we show that the class 
of pattern languages is fc-neighbor-minimally inferable from positive examples. 



2 Preliminaries 



2.1 A Language and a Distance 

Let 17 be a fixed finite alphabet. Each element of S is called a constant symbol. 
Let be the set of all nonnull constant strings over S and let E* = E~^ U {e}, 
where e is the null string. A subset L of E* is called a language. For a string 
w € E* , the length of w is denoted by |w|. 

A language L C 17* is said to be recursive, if there is a computable function 
/ : 17* {0, 1} such that f{w) = 1 iff w € L for w € E* . 

We consider a distance between two strings defined as follows: 

Definition 1. Let N = {0, 1, 2, • • •} be the set of all natural numbers. A function 
d : E* X E* —t Wu{oo} is called a distance over strings, if it satisfies the following 
three conditions: 

(i) For any v,w G E* , d{v,w) = 0 iff u = w. 

(ii) For any v,w G E* , d{v,w) = d(w,v). 

(iii) For any u,v,w G E* , d(u,v) + d(v,w) > d(u,w). 

A distance d is said to be recursive, if there is an effective procedure that 
computes d(v,w) for any v,w G E* with d(v,w) / oo. 

Then we define a fc-neighbor closure of a language as follows: 



Definition 2. Let d : E* x 17* —tNU {oo} be a distance over strings and let 
kGN. 

The k-neighhor closure of a string w G E* w.r.t. d is the set of all 

strings each of which is at most fc distant from w, that is, we put € 

17* I d{v, w) < k}. 

The k-neighbor closure of a language L C E* w.r.t. d is the set of all 

strings each of which is at most fc distant from some string in L, that is, we put 
= {v G E* \ 3w G L s.t. d{v, w) <k}. 



By Definition 1, we see that {w} 
L = C C C . . .. 



The following lemma is obvious: 



^ ^ ^(d.2) ^ ... and 



Lemma 1. Let d be a distance and let fc G N . 

(^d,k) 

For a language L C E* and for a string w G E* , wGL 

n L / 



if and only if 
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For a set S, we denote by ([S' the cardinality of S. 

Definition 3. A distance d is said to have finite thickness, if for any w € S* , 
is finite. 

A distance d is said to be recursively generable, if d has finite thickness and 
there exists an effective procedure that on inputs k £ N and w € S* enumerates 
all elements in and then stops. 

We note that a notion of the recursively generable finite-set-valued function 
was introduced by Lange and Zeugmann [8]. 

Example 1. (a) We consider a distance known as the Hamming distance. For a 
string w and for an integer i with 1 < i < |w|, hy w[i\, let us denote the t-th 
symbol appearing in w. For two strings v,w G S* , let 

^ / H* I 1 < * < ^ if kl = hi, 

i \ oo, if kl / \w\. 

Clearly, this distance d is recursively generable. 

(b) Next, we consider a distance known as the edit distance. Roughly speak- 
ing, the edit distance d over two strings v,w G E* is the least number of editing 
steps needed to convert v to w. Each editing step consists of a rewriting step of 
the form a —t e [a deletion), e —t b (an insertion), or a —t b {a change), where 
a,b G E . 

Clearly, this distance d is recursively generable. 

Let d be a recursively generable distance and let k G N. Then, for any 
v,w G E* , by checking v G whether d{v,w) < fc or not is recursively 

decidable. Therefore d turns to be a recursive distance. Let L C E* be a recursive 
language. Then, for any w € 17* , by checking Cl L (f>, whether w G 

or not is recursively decidable. Therefore is also a recursive language. 

In the present paper, we exclusively deal with a recursively generable dis- 
tance, and simply refer it as a distance without any notice. 

2.2 Infer ability from Examples 

Definition 4 (Angluin [2]). A class C = {Li]i^ff of languages is said to 
be an indexed family of recursive languages, if there is a computable function 
/ : W X 17* {0, 1} such that f{i,w) = 1 iff w € Li. 

In what follows, we assume that a class of languages is an indexed family of 
recursive languages, and identify a class with a hypothesis space. 

A positive presentation, or a text, of a nonempty language L C E* is an 
infinite sequence wi,W 2 ,--- G E* such that {wi,W 2 , ■ ■ ■} = L. In what follows, 
a or S denotes a positive presentation, cr[n] denotes the cr’s initial segment of 
length n G N , and cr[n]“'“ denotes the set of all elements appearing in cr[n]. 

An inductive inference machine {IIM, for short) is an effective procedure, 
or a certain type of Turing machine, which requests inputs from time to time 
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and produces positive integers from time to time. The outputs produced by the 
machine are called guesses. For an IIM M and for a finite sequence a[n] = 
wi,W 2 , ■ ■ ■ ,Wn, by M(cr[n]), we denote the last guess of M which is successively 
presented Wi,W 2 , ■ ■ ■ , on its input requests. 

Then we define the inferability of a class of languages as follows: 

Definition 5 (Gold [4], Angluin [2]). An IIM M is said to converge to an 
index i for a positive presentation a, if there is an n € A such that for any 
m>n, M(cr[m]) = i. 

An IIM M is said to infer a class C in the limit from positive examples, if for 
any Li £ C and for any positive presentation a oi Li, M converges to an index 
j for cr such that Lj = Li. 

A class C is said to be inferable in the limit from positive examples, if there 
is an IIM which infers C in the limit from positive examples. 

In the above definition, the behavior of an IIM is not specified, when we feed 
a positive presentation of a language which is not in the class concerned. On the 
other hand, Mukouchi [11] proposed inference criterion requiring an inference 
machine to infer an admissible approximate language within the hypothesis space 
concerned. 

Let S' be a subset of S* and let £ be a class. Then a language L C 17* is a 
minimal language of S within £, if (i) S C L and (ii) for any Li £ C, S C Li 
implies Li ^ L. 

The set of all minimal languages in £ of S within £ is denoted by MIN(S, £). 

Definition 6 (Mukouchi [11]). An IIM M is said to minimally infer a class £ 
from positive examples, if it satisfies the following condition: For any nonempty 
language L C 17* and for any positive presentation a of L, if MIN(L,£) / (f>, 
then M converges to an index i for a such that Li € MIN(L,£). 

A class £ is said to be minimally inferable from positive examples, if there is 
an IIM which minimally infers £ from positive examples. 

Now, we introduce our successful learning criterion we consider in the present 
paper. 

Definition 7. Let d be a distance and let k G N. 

For a class £ = {Li}i^N, let us put 

A k-neighbor system £ of a class £ w.r.t. d is the collection of languages 

each of which is a y-neighbor closure w.r.t. d of some language in £ for some 

J < k, that IS, we put £ = Uj=o ^ 

For a nonempty language L C 17*, a pair (i,j) £ N x N is said to be a weak 

k-neighbor-minimal answer for L, if y < fc and Li € MIN(L, £ ). 

For a nonempty language L C 17*, a pair (i,j) € A x TV is said to be a k- 
neighbor-minimal answer for L, if (i) {i,j) is a weak fc-neighbor-minimal answer 

for L and (ii) for any pair {i',j') with j' < j, Lii^ ^ ^ MIN(L,£^ ^). 
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An IIM M is said to k-neighbor-minimally (resp., weak k-neighhor-minimally) 
infer a class C w.r.t. d from positive examples, if it satisfies the following con- 
dition: For any nonempty language L C S* and for any positive presentation a 
of L, if MIN(L, £ ^ 4>, then M converges to an integer {i,j) for cr such 

that (i,j) is a fc-neighbor-minimal (resp., weak fc-neighbor-minimal) answer for 
L, where (•, •) represents the Cantor’s pairing function. 

A class C is said to be k-neighbor-minimally (resp., weak k-neighbor-minimally) 
inferable w.r.t. d from positive examples, if there is an IIM which fc-neighbor- 
minimally (resp., weak fc-neighbor-minimally) infers C w.r.t. d from positive 
examples. 

We note that, by the definition, a class C is (weak) 0-neighbor-minimally 
inferable w.r.t. d from positive examples, if and only if C is minimally inferable 
from positive examples. 

Assume that a class C is fc-neighbor-minimally inferable w.r.t. d from positive 
examples for some k £ N and for some distance d. For any Li € C, Li = 
Li'^'^^ and MIN(Li, = {Li}. Therefore, if {i' ,f) is a fc-neighbor-minimal 

answer for Li, then Li = Li' and j' = 0. Thus the class C is also inferable in the 
limit from positive examples. 

Therefore (weak) fc-neighbor-minimal inferability can be regarded as a nat- 
ural extension of ordinary inferability as well as minimal inferability. 

The rest of this section is devoted to summarize some known results related 
to this study. 

Angluin [2] characterized an ordinary inferability as follows: 

Definition 8 (Angluin [2]). Let C = {Li}i^ff be a class. 

A set S C S* is said to be a finite tell-tale set of a language Li £ C within C, 
if (i) S' is a finite subset of Li and (ii) for any Lj £ C, S C Lj implies Lj ^ Li. 

Theorem 1 (Angluin [2]). A class C is inferable in the limit from positive ex- 
amples, if and only if there is an effective procedure which on input i enumerates 
a finite tell-tale set S of Li £ C within C. 

Some sufficient but useful conditions for ordinary inferability have been pre- 
sented. 

Definition 9 (Wright [16], Motoki et al. [13]). A class C is said to have 
infinite elasticity, if there are two infinite sequences Wq,Wi,W 2 , ■ ■ ■ G S* and 
Lji , Lj 2 , • • • G £ such that for any i > 1, 

{wo, wi, • • • , C Lj. but Wi ^ Lj^. 

A class £ is said to have finite elasticity, if £ does not have infinite elasticity. 

Theorem 2 (Wright [16]). If a class £ has finite elasticity, then there is an 
effective procedure which on input i enumerates a finite tell-tale set S of Li within 
C, and thus £ is inferable in the limit from positive examples. 
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Definition 10 (Angluin [2]). A class C is said to have finite thickness, if for 
any nonempty finite set S C 17*, ‘^{Li £ C \ S C Li} is finite. 

As easily seen, a class C has finite thickness, if and only if for any w € 17*, 
‘}{Li £ C \w £ Li} is finite. 

Theorem 3 (Wright [16], Angluin [2]). If a class C has finite thickness, 
then C has finite elasticity, and thus C is inferable in the limit from positive 
examples. 

Definition 11 (Sato [14]). A class C is said to satisfy MEF -condition, if for 
any nonempty finite set S C 17* and for any Li £ C with S C Li, there is an 
Lj e MIN(S', £) such that Lj C Li. 

A class C is said to satisfy MFF- condition, if for any nonempty finite set 
5 C 17*, ttMIN(5,£) is finite. 

A class C is said to have M- finite thickness, if C satisfies both MEF-condition 
and MFF-condition. 

We note that, as easily seen, if a class has finite thickness, then the class has 
M-finite thickness. 

Theorem 4 (Sato [14]). If a class C has M-finite thickness and every language 
in C has its finite tell-tale set within C, then C is inferable in the limit from 
positive examples. 

Mukouchi [11] showed a sufficient condition for minimal inferability. 

Theorem 5 (Mukouchi [11]). If a class C has M-finite thickness and finite 
elasticity, then C is minimally inferable from positive examples. 

3 Neighbor-Minimal Inferability 

3.1 Some Properties 

Theorem 6. Let d be a distance and let k G N. 

If a class C has finite thickness (resp., finite elasticity or M-finite thickness), 

then the class C ’ also has finite thickness (resp., finite elasticity or M-finite 
thickness). 

Proof. (I) Let £ be a class with finite thickness. Then let w € 17*, and let us 
put S = and iF = {L£C\SriL^(f>}. Since we have assumed that d has 

finite thickness, we see that both [[•S' and are finite. 

Let L' € £ be a language with w £ L' . Then there is an L € £ such that 
L' = L . Since w £ L' , it follows that there is a. u £ L such that d(w, u) < k, 
i.e., u £ S. Therefore S Cl L ^ f, and thus L £ T. 

This means that {£' £ | w g L'} C | ^ g and thus 

i^d,k) 

'}{L' £ C ’ I w € L'} is not greater than , which is finite. 
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Therefore ^ has finite thickness. 



(^d,k) 

(II) The proof can be done by contraposition. Suppose that the class C 

has infinite elasticity, that is, suppose that there are two infinite sequences 
tvojtvi, ■ ■ ■ € 17* and Lj ^ , L'^, • • • € such that {wq,Wi,W 2 , • • • , C L' . 

but Wi ^ L'j. for any t > 1. 

For each i > 1, let Lj. € T be a language such that L' . = Lj. ’ . 

Put , Iq = N , and no = 0. For each i £ Iq\ {uq}, since wq G Lj. , 

it follows by Lemma 1 that Sq H Lj. / (f>. However, Sq is a finite set, we see that 
there is at least one uq G Sq such that uq belongs to infinitely many Lj . ’s with 
t G /o \ {«-o}- 

We define rtm’s and n^’s (m > 1) recursively as follows: Put /„ = {t G 
Im-\ \ {nm-i} I {uo,Uir ■ ■ ,Um-i} C LjJ and = min/„. We note that, by 
the construction, “ilm is infinite. Put Sm = For each i £ Im \ {«-„}, 

since G L'j . , it follows by Lemma 1 that Sm H Lj. / (f>. Since Sm is a finite 
set, we see that there is at least one Um G Sm such that Um belongs to infinitely 
many Lj.^s with i £ Im\ {^m}- 

By the construction, for each t > 1, {uq,ui, ■ ■ ■ C Lj^, . Furthermore, 

for each t > 1, since ^ , we see by Lemma 1 that Si Pi Lj^, = (f>, and 

thus Ui ^ Lj ^ . . Therefore C has infinite elasticity. 

(III) Let £ be a class with M-finite thickness. Let S C 17* be a nonempty 

finite set. Then let ns put Sq = and S = {S' C | 5 C It is 

easy to see that for any L G £, if there is an S" G such that S' C L, then 

Claim A: Let L' G with S C L' . Then for any L £ C with L' = ^ 

there exists an S" G such that S' C L. 

Proof of the Claim A. Let L G £ be a language such that L' = L^ 
put S' = Ln So . Then S' C L holds. 

We show that S' G S. Clearly, S' C holds. Let w £ S. Since S C L' = 

(^d,k) (<^ 5 ^) 

L , it follows that w £ L , and thus there is a rt G T such that d{u, w) < k. 
By w £ S, we see that u £ So- Therefore u £ L Cl Sq, i.e., u £ S' , and thus 
w £ This means that S C Therefore S' £ S. ■ 



-rid,k) 



and 



Let ns put A4 = MIN(S", £). Since £ satisfies MFF-condition, we see 

that tt7V4 is finite. 

Claim B: For any L' £ MIN(S', £^^’^^), there exists an. L £ M such that 
L' = L^'^’^\ 

Proof of the Claim B. Let L' £ MIN(5, £*‘^’''^) and let Li G £ with L' = 
jj(d.,k) ^ Then, by Claim A, there is an S" G such that S' C Li. 

Since S' C Li and £ satisfies MEF-condition, we see that there is an L G 
MIN (5', £) such that LC Li.By S' £S, it follows that L G M and 5 C C 
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_ j^i j^i g MIN(S', it turns out that That 

is, L’ = for L G M. ■ 



This claim means that MIN(S', C | £ g M}, and thus ttMIN(S', 

£^ ’ is not greater than ttA^, which is finite. Thus £^ ’ ^ satisfies MFF-condition, 

Claim C: The class £ ’ satisfies MEF-condition, that is, for any L' G 
£(rf,fc)^ if S' C L', then there exists an L" G MIN(S, £^^’^ ) such that L" C L' . 

Proof of the Claim C. Let L' £ C ’ with S C L' . Let us consider the class 
Af = C L' I L G A 4 }. As noted above, ttA4 is finite, and so is ^Af. 

Firstly, we show that Af is nonempty. Let Li G £ be a language such that 
L' = . Since S C L' = it follows by Claim A that there is an Si G S 

such that Si C Li. Then there is an L2 G MIN(Si,£) such that Si C £2 C Li, 
because £ satisfies MEF-condition. Thus L2 £ AA and S C C = £' 

holds. Hence £2 ' £ Af, and thus Af is nonempty. 

Since A/" is a nonempty finite class of languages, there is a minimal language 
£" G Af within Af. We show that £" G MIN(S, Since £" G Af, we see 
that S CL”. Suppose that there is an £"' G such that S C £"' C £". Let 

£3 G £ be a language such that £"' = £3 ’ . Since S C £"', we see by Claim 

A that there is an S' G S such that S' C £3. Then there is an £4 G MIN(S',£) 
such that 5" C £4 C £3, because £ satisfies MEF-condition. Thus £4 G A4 
and S C £4^^’^^ C £3^^’^^ = £'" C £" C £' holds. This means £4^^’^ G Af, 
which contradicts that £" is a minimal language within Af. Therefore there is 
no £'" G such that S C £'" C £", and thus £" G MIN(S, which 

concludes the proof. ■ 



Therefore the class has M-finite thickness. 



□ 



Lemma 2. If two classes £ and £' have finite thickness (resp., finite elasticity 
or M-finite thickness), then the class £u£' also has finite thickness (resp., finite 
elasticity or M-finite thickness). 

By Theorem 6 and Lemma 2, we have the following theorem: 

Theorem 7. Let d he a distance and let k £ N. 

If a class £ has finite thickness (resp., finite elasticity or M-finite thickness), 
then the class £ also has finite thickness (resp., finite elasticity or M-finite 

thickness). 

For a class £ = {Li}i^ff and for n > 1, let us put 

£[<"] — (Li,^ U • • • U Li^^ I *1 j ■ ■ ■ j G W and £^j , • • • , £i„ G £}. 

By assuming a computable bijective coding from N” to N, the new class above 
becomes an indexed family of recursive languages. 




192 



Yasuhito Mukouchi and Masako Sato 



Theorem 8 (Wright [16]). Let n € N. 

If a class C has finite elasticity, then £[^"1 also has finite elasticity. 

Theorem 9 (Sato [14]). Letn € N. 

If a class C has M-finite thickness, then £[^"1 also has M-finite thickness. 

By Theorems 8, 9 and 6, and Lemma 2, we have the following theorem: 



Theorem 10. Let d be a distance, let k G N and let n G N. 

— (rf,<fc) 

If a class £ has finite elasticity (resp., M-finite thickness), then (£l^"J) 
also has finite elasticity (resp., M-finite thickness). 



3.2 fc-Neighbor-Minimal Inferability 

The following theorem is a direct consequence of Theorems 5, 7 and 10: 

Theorem 11. Let d he a distance, let k G N and let n G N. 

If a class £ has M-finite thickness and finite elasticity, then £ and £[^"1 are 
weak k-neighbor-minimally inferable w.r.t. d from positive examples. 

For a language L C E* and for n £ N, we put = {w £ L \ \w\ <n}. 

We note that for two recursive languages L, L' C E* and for n £ N, whether 
C L'-" or not is recursively decidable. 

Lemma 3. Let L,L' C E* be two languages. 

(a) If L C V , then there is an integer m £ N such 

£<" c 

(h) If L 2 L' , then there is an integer m £ N such 

£<" g 

Lemma 4. Let C be a class with finite thickness. 

For a nonempty language L C E* , there is an integer m £ N such that for 
any n > m and for any L' £ C with L C L' , C L'-". 

Proof. Let L C E* be a nonempty language and let us put £' = {£' £ £ | 
L C L'}. Since £ has finite thickness, we see that ‘iC is finite. Let us put 

£' = {Li, •••,£[}. 

For each i with 1 < * < t, since L C LI, we see by Lemma 3 that there is 
an rui £ N such that for any n > mi, L-" C Put m = max{mi, • • • , m^}. 

Then the lemma holds. □ 

Theorem 12. Let d he a distance and let k £ N. 

If a class £ has finite thickness, then £ is k-neighbor-minimally inferable 
w.r.t. d from positive examples. 



that for any n > m, 
that for any n > m. 
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Procedure IIM M; 
begin 

let To := Eind let n := 0; 
repeat 

let n := n + 1; 

read the next example w, 

let T„ := T„_i U {to}; 

let In ■= {{i,m) |0<i<n, 0 < m < k, Tn ^ 

let Mn := 6 In \ 6 In, % (17'^*’’"^"}; 

if Mn ^ 0 then 

let := min{m | {i,m) € M„} and let in ■= min{i | (i,mn) € M„}; 

else 

let rrin ■= 0 and let in ■= n\ 
output {in, ran)', 

forever; 

end. 



Fig. 1. An IIM which fc-neighbor-minimally infers a class w.r.t. d from positive exam- 
ples 



Proof. Let C = {Lijigjv be a class with finite thickness. Then, by Theorem 7, 
we see that C has finite thickness. Let us consider the procedure in Figure 

I. 

Assume that we feed a positive presentation cr of a nonempty language 
.^target C S* to the procedure. 

Let /oo = \ i & N, 0 < m < fc, Ltarget C and let 

Moo = e /oo I y{i',m') € /oo, We note that 

I (*,m) G Moo}. 

Since has finite thickness, we see that there is a finite subset T^ore 

of Ltarget SUch that MIN (Teore , ) = MIN(Ltarget , • Since (T is a 

positive presentation and Tcore is a finite subset of Ltarget, it follows that there 
is an Ucore G N such that for any n > n^ore, T^ore ^ T„. 

Furthermore, since ^(d.,<k) hnite thickness, MIN(Ltarget, consists 

of finitely many languages, and let ns put MIN(Ltarget, C ) = {L[, ■ ■ ■ , L).}. 

For each i with 1 < i < t, L{ & MIN(Ltarget, L ~ ), and thus for any j with 
I < j <t, L' . Therefore, by Lemma 3, we can take riij ’s (I <i,j < t) such 
that for any n > riij, L^-" ^ Let ns put = m.ax{nij | 1 < *,} < t}. 

Claim A: Let L e MIN(Ltarget, For any L’ e HT^ore C L', 

then for any n > ha, L'-" ^ L-". 

Proof of the Claim A. Let L' € with T^oxe C L'. 
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(I) In case of V € MIN(Ltarget j ’ )• The claim is clear from the defini- 

tion of n^. 

(II) In case of L' ^ MIN(Ltargetj Since i^as finite thickness, 

there is an L" € MIN(Ltarget, such that L" C L' . Thus, for any n & N, 
L"-" C L'-". By the definition of ha, for any n > ha, L"-" ^ L-". Therefore, 
for any n > ha, L'-" ^ L-". 

By (I) and (II), the claim holds. ■ 

Claim B: For any {i,m) € M^o, there exists an nf^ € N such that for any 
n > nfr^, {i,m) € M„. 

Proof of the Claim B. Let {i,m) € Moo- Since € MIN(Ltarget, 

it follows that for any n £ N, Tn C Ltarget C Li ' \ Therefore, for any n >i, 
{i,m) e In- 

For any € In with n > Ucorej Tcore C Li'^ ’ \ and thus by Claim A 

we see that for any n' > ha, ^)-" ^ (l/^’™^)-" . 

Therefore, by putting nf^ = max{t, Ucore; ^a}j the claim holds. ■ 

Claim C: Let M = {{i,m) \ i £ N, 0 < m < kj. There exists an n-c G JV 
such that for any n > n-c and for any (i,m) € M \ Moo, (i,ni) ^ M„. 

Proof of the Claim C. For each n £ N, we put £„ = {l/ ’ ^ | (t, m) € M„}. 
Let (i, m) & M \ Moo- 

(I) In case of Ltarget 2 . By the definition of Tcore, we see that Tcore 2 

Therefore, for any n > Ucore, Tn % and thus (i,m) ^ Mn- 

(II) In case of Ltarget C . By Claim B, wee see that there is an rimin G N 

such that for any n > rimin, {Lj,---,L(} C £„• For each j with 1 < i < t, 
by Lemma 4, there is an nj € N such that for any L' € £ with L' C L', 

Lj“" C L'-". Let us put riind = max{nmin, n-i, • • • ,nt}- 

Since {i, m) ^ Moo and that £ has finite thickness, there is an L' € 

MIN(Ltarget,L^‘^’“^^) such that Ltarget C L' C Thus, for any n > Uinci, 

there is an L' € £„ such that L-" ^ L'-", and thus (i,m) ^ Mn- 

Let nc = max{ncore, n-inci}- Then, by (I) and (II), the claim holds. ■ 

Let rUoo = min{m | {i,m) € Moo} and let ioo = min{t | {i,nioo ) G Moo}. By 
Claims B and C, we see that the procedure in Figure 1 converges to {ioo,fnoo) 
for a. □ 

4 Pattern Languages and Their Union 

In the present section, we consider the class PAT of pattern languages introduced 
by Angluin [1]. 

Let X be a set of variable symbols. A pattern is a nonnull string of constant 
symbols in S and variable symbols in X. The pattern language L{p) generated 
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by a pattern p is the set of all constant strings obtained by substituting nonnull 
constant strings for the variable symbols in p. 

For example, let S = {a,b,c\ and let X = {x,y,---}. Then p = axbxcy is 
a pattern, and the set {aabac, abbbc, acbcc, aaabaac, aabbabc, ■ ■ ■} is the pattern 
language of p. 

Since two patterns that are identical except for renaming of variable symbols 
generate the same pattern language, we do not distinguish one from the other. 
The set of all patterns is recursively enumerable and whether w € L{p) or not 
is recursively decidable for a constant string w and for a pattern p. Therefore 
we can consider the class VAT of pattern languages as an indexed family of 
recursive languages, where the pattern itself is considered as an index. 

For a constant string w £ S* and for a pattern p, if w € L{p), then p is not 
longer than w. Since the number of patterns shorter than a fixed length is finite, 
the class VAT has finite thickness (cf. Angluin [1]). Therefore, by Theorems 3 
and 12, we have the following theorems: 

Theorem 13 (Angluin [1]). The class VAT is inferable in the limit from 
positive examples. 

Theorem 14. Let d he a distance and let k G N. 

The class VAT is k-neighbor-minimally inferable w.r.t. d from positive ex- 
amples. 

Furthermore, Wright [16] showed that the class VAT^-"^ has finite elasticity. 
Therefore, by Theorems 2 and 11, we have the following theorems: 

Theorem 15 (Wright [16]). Let n € N. 

The class is inferable in the limit from positive examples. 

Theorem 16. Let d he a distance, let k G N and let n G N. 

The class VAT^~^ is weak k-neighbor-minimally inferable w.r.t. d from pos- 
itive examples. 

5 Concluding Remarks 

We have introduced a notion of a recursively generable distance and defined 
a fc-neighbor closure of a language by taking incorrectness into consideration. 
Then we have formalized fc-neighbor-minimal inferability and gave some suffi- 
cient conditions. 

We have shown that the class with M-finite thickness and finite elasticity is 
weak fc-neighbor-minimally inferable from positive examples (Theorem 11). It is 
known that there are many rich classes that have M-finite thickness and finite 
elasticity (cf. Sato [14] and Mukouchi [11]). On the other hand, we could show 
that the class with finite thickness is fc-neighbor-minimally inferable from posi- 
tive examples (Theorem 12). This difficulty comes from necessity of eliminating, 
in a finite time, the possibility that infinitely many non-minimal languages may 
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behave as if they are minimal languages (cf. Claim C in the proof of Theorem 

12 ). 

As a future work, we can consider refutable inference from complete exam- 
ples, some of which are incorrect. As another future investigation, it is worth 
to develop an efficient learning algorithm for the class of e.g. regular pattern 
languages in our framework. 
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Abstract. Recent space plasma observations provide us three-dimensional 
velocity distributions which are found to have multiple peaks. We pro- 
pose a method for analyzing such a velocity distribution via a multivari- 
ate Maxwellian mixture model whose each component represents each of 
the multiple peaks. The parameters of the model are determined through 
the EM algorithm. For an auto judgment of preferable number of com- 
ponents of the mixture model, we introduce a method of examining the 
number of extremum of the resulting mixture model. We show applica- 
tions of our method to observations in the plasma sheet boundary layer 
and in the central plasma sheet in the terrestrial magnetosphere. 



1 Introduction 

From direct satellite observations of space plasma, we have obtained macroscopic 
physical quantities by calculating velocity moments of particle velocity distribu- 
tion functions {e.g. number density, bulk velocity and temperature). This fluid 
description assumes that the plasma is in a state of local thermal equilibrium. 
Under that assumption, a distribution function of particle velocity is given as 
a normal distribution which is called Maxwellian distribution in a field of the 
plasma physics. The (multivariate) Maxwellian distribution is given by 

g{v\ V, T) = ^exp [-^ (v - vfT~^ {v - V)] , (1) 

where m is the mass of the particle, V is the bulk velocity vector, T is the 
temperature matrix, and superscript T denotes transpose. 

Observational techniques progressed notably today, and made it possible to 
measure detailed shape of distribution function in the three-dimensional velocity 
space. Those observations revealed that there are many cases where space plas- 
mas is not in thermal equilibrium. Their velocity distributions are not a single 
Maxwellian but consist of multiple peaks. This is because since the space plasma 
such as the solar wind is basically collisionless with large mean-free-path (about 
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1 Astronomical Unit: the distance between the Sun and the Earth). Therefore, 
we have to be careful that they may give the same velocity moments even if the 
shapes of distribution functions are different. For instance, when the plasma with 
two beam components whose velocity vectors are sunward and anti-sunward, re- 
spectively, and the numbers of particles of each component are the same, the 
bulk velocity becomes zero from calculation. On the other hand, when a stag- 
nant plasma is observed, the bulk velocity also becomes zero. When we deal with 
the two-beam distribution, we should separate it into two beams and calculate 
the velocity moments for each beam. It has been difficult, however, to evaluate 
the shape of distribution function, especially when more than one components 
partially overlap each other. Furthermore, it produces a serious problem when 
we treat many multi-component cases statistically. 

In this paper, we construct a method of representing three-dimensional dis- 
tribution function by a multivariate Maxwellian mixture model in which the pa- 
rameter values are obtained by the EM (Expectation-Maximization) algorithm 
[2] [4]. With this method, we can express the shape of the function and find a 
possible way to conduct a statistical analysis. The organization of this paper is 
the following. In Sect. 2, we describe the data of plasma velocity distribution 
functions. A fitting method with the multivariate Maxwellian mixture model is 
presented in Sect. 3, followed by the consideration to determine the preferable 
number of components in the mixture model in Sect. 4. Two applications are 
demonstrated in Sect. 5. In Sect. 6, We discuss a problem of model selection. 



2 Data 

2.1 Instrumentation 

We used the data obtained by the Low Energy Particle Energy-per-charge Ana- 
lyzer (LEP-EA) onboard the Geotail spacecraft. LEP-EA consists of two nested 
sets of quadrispherical electrostatic analyzers to measure three-dimensional ve- 
locity distributions of electrons (with EA-e) and ions (with EA-i) simultaneously 
and separately. In the present observation, EA-i covers the energy range from 
32eV/q to 39keV/q divided into 32 bins, in which 24 bins are equally spaced on 
a logarithmic scale in energies higher than 630 eV/q and have width of ±9.4 % of 
the center energy, while the lower-energy 8 bins are spaced linearly with width of 
±40eV/q (±20eV/q for the lowest energy bin). The full energy range is swept 
in a time which is 1/16 of the spacecraft spin period (about 3 seconds). The 
field of view is fan-shaped with ~ 10° x 145°, in which the longer dimension 
is perpendicular to the spin plane and divided into seven directions centered at 
elevation angles of 0°, ±22.5°, ±45° and ±67.5° with each width of 6-10° (wider 
for higher elevation angles). Thus, count rate data of dimension 32 (energy bin) 
X 7 (elevation angle) x 16 (azimuthal sector) are generated in one spin period. 
These classes in the velocity space are shown in Fig. 1. 

The complete three-dimensional velocity distributions are obtained in a pe- 
riod of four spins (about 12 seconds) owing to the telemetry constraints; the 
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Fig. 1. Classes for observations of ion velocity distributions with LEP-EAi 



count data are accumulated during the four-spin period. A more detailed de- 
scription of LEP instrumentation is given in [3] . 

Assume that an electrostatic analyzer detected the particle count C {Vpqr) [#] 
in a sampling time r [s], where Vpqr [m/s] is the particle velocity. Subscription p, 
q and r take integers and denote the position in the three dimensional velocity 
space. LEP-EA, for instance, has p = 1, 2, • • • , 32; g = 1, 2, • • • , 7; r = 1, 2, • • • , 16, 
where we choose p, q and r as indicators of energy bin, elevation angle and az- 
imuthal sector, respectively. Thus we obtain the total number of the particle 
count N [#] : 

N=J2C{Vpqr). (2) 

p,q,r 



2.2 Density Function 



Under the assumption that the incident differential particle flux is uniform within 
the energy and angular responses of the analyzer, the velocity distribution func- 
tion /o (Vpqr) [m^/s®] is given by 



/O (Vpqr) = 2 X 10" 



1 C {Vpq^') 

TEQ (q,T q, ’ 

y^pqr^pqr) 



( 3 ) 



where m [kg] and q [C] are the mass and the charge of the particle, e is the 
detection efficiency, 1/ [cm^ sr eV/eV] is the geometrical-factor and superscript 
^ denotes transpose. LEP-EA has e and Q as functions of elevation angle: e = 
Sq,Q = Qq. Integrating /o (Vpqr) over the velocity space, we obtain the number 
density n [#/m^]: 



— ^ ^ /o {'Vpqr) dVp 



( 4 ) 



In this paper, we treat the probability function: 

/o {'Vpqr) dVpq^ 



/ {Vpqr) 



n 



( 5 ) 
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SO that 

/ i'^pqr) = 1 - ( 6 ) 

p,q,r 

Since observed distribution is a function of discrete variables, it is necessary 
to consider the mixture of probability function, but we approximate it by the 
mixture of Maxwellian that is a probability density function. 



3 Maxwellian Mixture Model 



We will fit the probability function (5) by the mixture model composed of the 
sum of s multivariate Maxwellian distributions: 

/ i'^pqr) — ^ ^ '^iQi {'^pqr\^ Tj) , (7) 

where rii is the mixing proportion of Maxwellians = 1, 0 < < 1). 

Each Maxwellian gi is written as 

9i i'^pqrl ^ ii Tj) 

where V, [m/s] is the bulk velocity vector and T, [J] is the temperature matrix 
of i-th Maxwellian. 

On the condition above, we consider the log-likelihood: 



i(e) = NY. f i'^pqr) log ('^pqr\Vi ; Tj) (9) 

p,q,r i—1 

in order to obtain the maximum likelihood estimator of each parameter, where 
9 = (rii , ri 2 , • • • , Us-i , Vi, V 2 , •••, Vs, Ti, T 2 , •••, T^) (10) 



means the all unknown parameters. 

Partially differentiate (9) with respect to V,, T~^ (* = 1, 2, • • • , s) and put 
them equal to zero, maximum likelihood estimators (denoted by " ) of the mix- 
ing proportion, the bulk velocity vector and the temperature matrix of each 
Maxwellian are given by: 



^ ^ / {'^pqr) Pi {'^pqr) , 

p,q,r 

^ ^ ('^pgr) '^pqrt 

* p,g,'T 

Tj — ^7- ^ ^ / ("^pgr) ('^pgr) 2^ ^’^pgr ^’^pgr ’ 

* p,g,'T 



( 11 ) 

(12) 

(13) 
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where 



^ y^pqr | 5 Tj j 

Pi {Vpqr) = ~ 

^Vpqr\Vj, Tjj 

i=l 



(14) 



is an estimated posterior probability. 

On the basis of these equations, we will estimate the unknown parameters 
by the EM algorithm [2] [4]. In the following procedure, t denotes an iteration 
counter of the EM algorithm. Suppose that superscript (t) denotes the current 
values of the parameters after t cycles of the algorithm for t = 0, 1, 2, • • •. 



Setting Initial Valne: t = 0. Set the initial values of the parameters of each 
Maxwellian and posterior probability. At first, we classify the data in s groups 
(G,; i = 1, 2, • • • , s) using the fc-means algorithm, and set the initial value of 
the posterior probability: 



3(0) 



(Vp 




{Vpq^ G Gj) 
{Vpq^ 0 Gj) 



(15) 



where i = 1,2, From the initial value of the posterior probability, we 

calculate by (11), (12) and (13). 



Parameter Estimation by the EM Algorithm: t > 1. 

E-step (Expectation step). At E-step of the t-th iteration {t > 1), we compute 
the posterior probability by Eq. (14). At the same time, we estimate the mixing 
proportion nf '^ by Eq. (11). 



M-step (Maximization step). At M-step of the t-th iteration, we choose the 
values of the bulk velocity vector and temperature matrix as maximum likelihood 
estimators by Eqs. (12) and (13). 



Jndgment of Convergence. We finish the iteration if the following conver- 
gence condition is satisfied: 



i(t) (ow'j 



< e 



and 



Qp) _ ^b-1) 



< <5, 



(16) 



where e and <5 are sufficiently small positive number. If the above condition is 
not satisfied, return to the E-step with replacing t by t -I- 1. 
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4 Preferable Number of Components 

When we fit a distribution function by the Maxwellian mixture model, we should 
examine that this fitting is reasonable. Fig. 2 shows two examples of applying 
a single Maxwellian and a two-Maxwellian mixture model for different types of 
observed distribution functions. In Fig. 2(a), the upper plot shows the observed 
bimodal distribution, and the lower two plots show the fitting result of a single 
Maxwellian at the left-hand and a two-Maxwellian mixture at the right-hand, 
respectively. We find that the two-Maxwellian fitting well presents the bimodal 
observation in this example. Fig. 2(b) shows an example of unimodal observation 
and the fitting results; we should not approve the two-Maxwellian fitting of the 
observation. 




Fig. 2. Examples of applying of the single Maxwellian and the two-Maxwellian mix- 
ture model fitting for different type of observations: (a) bimodal distribution and (b) 
unimodal distribution 



Here, we introduce a method of judging which of models, i.e., a single compo- 
nent model or a two-component mixture model is preferable for each observation. 
We adopt the following principle. If the resulting two-component mixture model 
has two peaks, the observation would also have two peaks. Hence, we conclude 
that the two-component mixture model is reasonable to use. On the other hand, 
if the resulting two-component mixture model has only one peak, the observa- 
tion would be a unimodal distribution: we should use a usual single Maxwellian 
fitting. 



4.1 Diagnostics of Fitting 

To judge whether the fitting result is reasonable, therefore, we count the number 
of peaks of the resulting fitting model. Furthermore, to count the number of 
peaks, we count the number of extremum of the model. Let us consider when we 
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fit some data / (v) by a two-Maxwellian mixture model and the fitting result is 
computed as 



f{v) ~ mgi {v\Vi, Ti) +ri25f2 {v\V2, T2) . (17) 

To count the number of peaks, we need to count the number of v that satisfy 

{niQi {v\V i, Ti) + n2Q2 {v\V 2, T^2)) = ^- (18) 

dv 

It is difficult, however, to treat the three-dimensional variable v. In the following 
of this section, we derive one set of simultaneous equations of one-dimensional 
variables, whose number of solutions are equivalent to the number of v that 
satisfy Eq. (18). 

Without loss of generality, we can shift the origin of the velocity space such 
that the bulk velocity of the component 1 (Vi) vanishes: 

/ (v') ~ nm (t»'|0, Ti) + n2Q2 (v'\V, T2) , (19) 

where we put v' = v — V\, V' = V2 —V\. Since we are interested in the 
topological form of the function, we can change the scale and rotate the main 
axes such that the temperature matrix becomes isotropic. That is, since Ti is a 
symmetric matrix, we can define a new matrix L that satisfies LL^ = T)“^. As 
an expression of L, we choose 



L = 




(20) 



where Ai , A2 , A3 are the eigenvalues of T)“^ , and X\,X2,Xi are the corresponding 
eigenvectors whose absolute values are unity. Transforming the coordinate with 
L^, that is, putting new vectors u and U that satisfy u = and U = L^V', 
then we can express the fitting function (19) as 



f(u) 



[ni9i (w|0, I) 

Vl -*-11 

+H292 {u\U , L^T2L)] , 



(21) 



where I is the unit matrix. 

Since T2 is a symmetric matrix, (L^T2L) is also symmetric, so that we 
can carry out the orthogonal transformation such that (L^T2L) ^ becomes a di- 
agonal matrix M. That is, when we put /ri, IJ2, Mz as eigenvalues of (L^T2L) ^ 
and t/i, y2, as corresponding eigenvectors, and rotate the main axes such 
that u = {yi y2V3)w, then we obtain 



f(w) 



[ni9i («>|0, I) 

Vl -*-11 

+U292 {w\w, M-1)] , 



(22) 
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where 

o\ 

M = (23) 

V O liz) 

and U = (j/i j /2 Va) ^ 

To obtain the (transformed) velocity w that gives an extremum, we differ- 
entiate Eq. (22) with respect to w. We then obtain the extremum condition for 
w: 

w = [riigi (tn|0, I) I + U 2 Q 2 (w\W , M“^) M] 

■n292 {w\W,M-^)mW, (24) 

that is, 



Wa 



mgi (w|0, I) 
ri 2 fif 2 {w\W, M“i) 



+ Ma 



(25) 



where 0 = 1,2, 3. 

To examine the Eq. (25), we introduce the parameter ^ defined by 



niffi (w|0, I) 
ri 2 fif 2 {w\W, M“i) 



so that 






(26) 



(27) 



Substituting Eq. (27) into Eq. (26), we obtain an equation with respect to 

nigi {w (^) |0, I) 



^ n2g2{w{0\w,m-^y 

A solution of Eq. (28) is given by the node of the line 

viO = ^ 

and the curve 

nigi {w (^) |0, I) 



= 



n292 {w (a|EE, M-i) 



(28) 



(29) 



(30) 



in the plane. In this way, formerly three dimensional problem has been 
reduced to one dimensional and it becomes easier to examine the number of 
extremum. 
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4.2 Simulation Study 

We determine which of fitting results is preferable by checking the resulting 
two-mixture fitting. The left-hand panel of Fig. 3(a) shows the fitting result 
represented by the sum of the two components obtained for the bimodal ob- 
servation shown in Fig. 2(a). This result has three extrema (two maxima and 
one saddle point between them) as denoted by three dots. We plot each relation 
of the simultaneous equations (29) and (30) in the plane in the right-hand 
panel of the Fig. 3(a). In a practical data processing, we first find three intersec- 
tions of these two graphs which is equivalent to the number of the extremum of 
the resulting two-component mixture model, then judge that the two-mixture is 
preferable to a single component model. 

Similarly, the left-hand panel of Fig. 3(b) shows the resulting two-mixture 
fitting for unimodal observation shown in Fig. 2(b), and the right-hand panel 
shows the relation between ^ and rj based on the simultaneous equations (29) 
and (30) . Different from case (a) , the number of the solution of the simultaneous 
equations is unity and the number of extremum of two mixture model is also 
unity. We adopt, therefore, the single Maxwellian model which represents usual 
velocity moments. 



(a) 



Two Maxwellian Mixture 

/t 




Number of Extremum 







(b) 



Two Maxwellian Mixture 



/t 




V 




Fig. 3. The number of the extremum of the fitted function of two-Maxwellian mix- 
ture distribution for different type of observations: (a) bimodal distribution and (b) 
unimodal distribution. In both (a) and (b), the left-hand plot shows the two-mixture 
distribution whose extremum is denoted by the dots, and the right-hand plot shows 
the line (Eq. (29)) and the curve (Eq. (30)) whose intersections are also denoted by 
the dots 
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5 Application 

Now we apply the fitting method explained above to the observation of ion distri- 
bution function. We first fit by the mixture distribution with fixed components of 
the number of two, then examine the result, and judge whether the fitting by two 
components is reasonable or single component is better to fit the observation. 
In the following part of this section, we will apply our method to two different 
kinds of observation: one is an example that the fitting by two components is 
preferable and the other is that one component fitting is better. 



(a) 1554:58 - 1555:10 UT on January 14, 1994 

Observation Single Maxwellian Two-Maxwellian Mixture 




-2000-1000 0 1000 2000 -2000-1000 0 1000 2000 -2000-1000 0 1000 2000 
Vj. [km/s] V, [km/s] [km/s] 



(b) 1558:58 - 1559:10 UT on January 14, 1994 

Observation Single Maxwellian Two-Maxwellian Mixture 




-2000-1000 0 1000 2000 -2000-1000 0 1000 2000 -2000-1000 0 1000 2000 
[km/s] V, [km/s] [km/s] 



Fig. 4. (a) Observation of ion velocity distribution function in the Vx~Vy plane in the 
time interval of 1554:58-1555:10 UT on January 14, 1994 (left), fitting function by the 
single Maxwellian (s = 1, center) and fitting function by the two-Maxwellian mixture 
model (s = 2, right), (b) In the time interval of 1558:58-1559:10 UT on the same day 



First, let us apply to the ion velocity distribution in the time interval 1554:58- 
1555:10 on January 14, 1994. The left-hand panel of Fig. 4 (a) shows the ob- 
servation obtained in the plasma sheet boundary layer. We show the distri- 
bution functions sliced by the Vx~Vy plane whose value is black-to- white-coded 
according to the bar at the right of the panel. When we see the v^-Vy slice, 
we find hot component and cold component whose bulk velocities are about 
{vx, Vy) = (1000 km/s, Okm/s) and {vx, Vy) = (—200 km/s, —500 km/s), respec- 
tively. 
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When we fit that data with a single Maxwellian, we obtain the estimated 
parameters shown in the second row of Table 1 (a). With these parameters, we 
can produce the distribution function shown in the central panel of Fig. 4 (a). 
This corresponds to what we deal with by the usual velocity moment, and it is 
hot and has a shifted bulk velocity compared with the observation. 



Table 1. Estimated parameters for single Maxwellian and for two-Maxwellian mixture 
model: (a) in the time interval 1554:58-1555:10 on January 14, 1994, and (b) in the 
time interval 1558:58-1559:10 on the same day 



(a) 1554:58-1555:10 UT on January 14, 1994 





n [/cc] 


14 [km/s] 


Vy 


14 


T.. [eV] 


Txy 


Txz 


'^yy 


Tyz 


Tzz 


1 


0.038 


522 


-189 


41 


5560 


518 


-86 


1951 


-29 


1074 


1 


0.018 


1173 


-66 


20 


2914 


-456 


79 


3094 


76 


2100 


2 


0.019 


-84 


-304 


60 


54 


-81 


14 


602 


-80 


109 



(b) 1558:58-1559:10 UT on January 14, 1994 





n [/cc] 


14 [km/s] 


Vy 


V. 


Txx [eV] 


Txy 


Txz 


Tyy 


Tyz 


Tzz 


1 


0.065 


139 


-312 


-87 


3859 


13 


-63 


3467 


-42 


3721 


1 


0.024 


84 


-99 


-22 


6117 


116 


-178 


5183 


556 


4620 


2 


0.041 


171 


-437 


-126 


2498 


66 


39 


2011 


-532 


3150 



This problem, however, is easily solved by applying a two-component mixture 
model. Similarly, we give the estimated parameter in the third and fourth rows 
of Table 1 (a) , and display the calculated distribution function in the right-hand 
panel of Fig. 4 (a). We found the hot and cold components seen in the observed 
distribution were reproduced adequately. 

For this example, we found that the fitting by two-component is more prefer- 
able than by single component by counting the number of the solutions of simul- 
taneous equations (29) and (30). Fig. 5 (a) shows the relation of ^ and rj. The two 
graphs have three intersections, that is, the two-component mixture model has 
two maxima and one minimal between them. Thus, the two-component fitting 
is justified, which agrees with our inspection of the observed distribution. 

The other example is a distribution function observed in the time interval of 
1558:58-1559:10 on the same day. These data were obtained in the central plasma 
sheet. As can be seen in the left-hand panel of Fig. 4 (b), it is appropriate to 
think that this consists of single hot component whose bulk velocity is located 
near the origin of the velocity space. Hence, when we fit the data, we should 
adopt not two-Maxwellian mixture but single Maxwellian model. 

In the central panel of Fig. 4 (b), we show the calculated distribution function 
with the single component model. Used parameters are shown in the second row 
of Table 1(b). In this case, this single-component fitting appears to be sufficient. 
Further, we display the result when used the two-component mixture model. The 
right-hand panel shows the calculated distribution function with the estimated 
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(a) 1554:58 - 1555:10 (b) 1558:58 - 1559:10 








Fig. 5. Relation between ^ and i] determined by the simultaneous equations (29) and 
(30): (a) in the time interval of 1554:58-1555:10 (b) in the time interval of 1558:58- 
1559:10 on January 14, 1994 



parameters shown in the third and fourth rows of Table 1 (b). For this example, 
we think that this two-component fitting is an over-fitting. 

In this case, when we examine the number of solutions of the simultaneous 
equations for ^ and rj, we found that they have only one solution as shown in 
Fig. 5 (b). Since the resulting two-component mixture model has no minimal in 
this example, we will adopt the usual velocity moments obtained by the single 
component fitting. 



6 Discussion 

In the course of choosing the preferable number of components, we first fit the 
data with a two-component mixture model, then examine whether there is a 
saddle point on the segment between the bulk velocities of the model. 

When we select the preferable number of components, that is, when we com- 
pare the goodness between the models, it has been known that AIC (Akaike 
Information Criterion, [1]) is useful. If there are several candidate models, we 
can find the best one by finding the model with smallest value of AIC defined 
by 



AIC = — 2max / (0) -b 2 dim0, (31) 

In the present analysis, however, AIC has not given an expected criterion as we 
explain in the following. 

When we apply the mixture model whose component number is 1 to 6 (s = 
1, 2, • • • , 6) to the same data in the previous section, the corresponding dim0, 
maxl{9), AIC, and BIC (explained below) are summarized in Table 2. The 
baselines of AIC and BIC are taken to be those of the single Maxwellian fittings. 
In case (a) , AIC decreases with increasing the number of components and has a 
minimum at the number of components 5. This number, however, is thought to 
be too large to fit the observation. On the other hand, in case (b), AIC has no 
minimum in the range of the number 1 to 6, so we have to prepare seven or more 
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Table 2. Comparison of six Maxwellian mixture models. Asterisks denote the minimum 
of AIC and BIG 



(a) 1554:58-1555:10 UT on January 14, 1994: N = 3156 



Comp. 


dime 


max 1 (e) 


AIC - AIC(l) 


BIG - BIC(l) 


1 


9 


-72217.82 


0.00 


0.00 


2 


19 


-68560.85 


-7293.94 


-7273.65 


3 


29 


-68286.86 


-7821.91 


-7781.33 


4 


39 


-67586.02 


-9203.61 


-9142.75 


5 


49 


-67176.28 


* - 10003.08 


* - 9921.94 


6 


59 


-67565.62 


-9204.41 


-9102.98 


(b) 1558:58- 


1559:10 UT on January 14, 1994: N = 4860 


Comp. 


dime 


max 1 (e) 


AIC - AIC(l) 


BIG - BIC(l) 


1 


9 


-113789.51 


0.00 


0.00 


2 


19 


-113605.13 


-348.77 


-326.32 


3 


29 


-113302.66 


-933.70 


-888.82 


4 


39 


-113232.97 


-1053.09 


-985.76 


5 


49 


-113195.14 


-1108.77 


-1018.99 


6 


59 


-112407.97 


* - 2663.08 


* - 2550.86 



components for the appropriate fitting. This number of component also seems 
to be too large. 

These results do not agree with our intuition. We think that it is due to 
the following three reasons. First, adopting the Maxwellian distribution as a 
component distribution of mixture model is not appropriate, which is notable in 
the high-energy range. Since the observation has a heavy tail in the high-energy 
range, it is necessary to have many components for fitting such a tail accurately. 
One of the heavy tail distributions is k distribution defined by 



9i {'^pqr I 

/mx3/2 r{Kj) 1 

\ 2 tt) r(Ki-3/2) 



TTi 

1 + (Vpqr 



Vi)’^Ti ^ {Vpqr 



"( 32 ) 



This converges to Maxwellian distribution in the limit of /t, ^ oo. When we 
select K distribution as a component distribution (i.e., k mixture model), the 
algorithm in Sect. 3 can work by including the Ki renewing step. By carrying 
our a calculation, however, we found that the estimated Ki value is the order 
of 10^, which means that the distribution is practically the Maxwellian, that 
is, we failed to lift up the tail of k distribution. Since k distribution includes 
Maxwellian, we could treat the data in more comprehensive way if the fitting 
by K mixture model was successfully done. We need some improvement in our 
algorithm. 

The next reason for decreasing AIC for increasing the number of component 
is that the observation has many data number {N = 3156 and 4860 for the 
examples (a) and (b) , respectively) . Log-likelihood is multiplied by data number 
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N as defined in Eq. (9), and has the order of lO'^-lO® in the two examples. On 
the other hand, the dimension of free parameters 9 is the order of 10° - 10^. AIC 
is determined practically by log-likelihood and is not affected by the dimension 
of free parameters as a penalty term. 

We then evaluated by BIG instead of AIC. BIC is an information criterion 
such that posterior probability is maximized and defined as[5] 

BIC = -21 {d) + \- dim 9 ■ log N. (33) 

With BIC, however, we obtained the same result as with AIC (see Table 2). 

Finally, the decreasing AIC also depends on higher one count level near the 
origin of the velocity space, which is seen as a ‘hole’ in the distribution function 
near the origin. This is a property of electrostatic analyzers, and it means poor 
resolution of density of velocity distribution near the origin. For instance, let us 
assume that an ambient plasma distribution /q“ is a simple Maxwellian which 
has a single peak near the origin of the velocity space. Then an electrostatic 
analyzer observes this plasma, that is, detects as count When the count 

equivalent to /q™ (calculated by Eq. (3) is less than one, C'°°® becomes zero 
since C'°°® takes an integer. This omitting occurs especially near the origin of 
the velocity space, which produce the observed distribution having no peak 
near the origin but forming a hole like a caldera. Therefore, multi components 
that AIC requires is to present the edge of the caldera. We should reconsider the 
treatment the density near the origin. 

In this paper, we only examined the existence of a saddle point between the 
two bulk velocities in the present case study in which the observed distribution 
has two peaks whose mutual distance is long enough compared with each tem- 
perature matrix. But this procedure is not valid if two bulk velocities is close 
to each other since the saddle point does not appear between the two velocities. 
Such data are seen in the electron distribution functions (photo-electron and 
ambient-electron components) and also in the solar wind plasma (core and halo 
components) . For such data, we need other criterion instead of the saddle point 
searching method. We address this as a future study. 
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Abstract. This paper presents knowledge discovery from fMRI brain 
images. The algorithm for the discovery is the Logical Regression Analy- 
sis, which consists of two steps. The first step is regression analysis. The 
second step is rule extraction from the regression formula obtained by 
the regression analysis. In this paper, we use nonparametric regression 
analysis as a regression analysis, since there are not sufficient data in 
knowledge discovery from fMRI brain images. The algorithm has been 
applied to two experimental tasks, finger tapping and calculation. Ex- 
perimental results show that the algorithm has rediscovered weU-known 
facts and discovered new facts. 



1 Introduction 

Analysis of brain functions using functional magnetic resonance imaging(f-MRI), 
positron emission tomography (PET), magnetoencephalography (MEG) and so 
on is called non-invasive analysis of brain functions[4]. As a result of the ongoing 
development of non-invasive analysis of brain function, detailed functional brain 
images can be obtained, from which the relations between brain areas and brain 
functions can be understood, for example, the relation between a subarea and 
another subarea in the motor area and a huger movement. 

Several brain areas are responsible for a brain function. Some of them are 
connected in series, and others are connected in parallel. Brain areas connected in 
series are described by “AND” and brain areas connected in parallel are described 
by “OR”. Therefore, the relations between brain areas and brain functions are 
described by rules. 

S. Arikawa and S. Morishita (Eds.): DS 2000, LNAI 1967, pp. 212-223, 2000. 
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Researchers are trying to heuristically discover the rules from functional brain 
images. Several statistical methods, for example, principal component analysis, 
have been developed. However, the statistical methods can only present some 
principal areas for a brain function. They cannot discover rules. This paper 
presents an algorithm for the discovery of rules from fMRI brain images. 

fMRI brain images can be dealt with by supervised inductive learning. How- 
ever, the conventional inductive learning algorithms[5] do not work well for fMRI 
brain images, because there are strong correlations between attributes(pixels) 
and a small number of samples. 

There are two solutions for the above two problems. The hrst one is the 
modihcation of the conventional inductive learning algorithms. The other one is 
nonparametric regression. The modihcation of the conventional inductive learn- 
ing algorithms would require a lot of effort. On the other hand, nonparametric 
regression has been developed for the above two problems. We use nonparamet- 
ric regression for the knowledge discovery from fMRI brain images. The outputs 
of nonparametric regression are linear formulas, which are not rules. However, 
we have already developed a rule extraction algorithm from regression formu- 
las[9],[10], [12]. 

The algorithm for knowledge discovery from fMRI brain images consists of 
two steps. The hrst step is nonparametric regression. The second step is rule 
extraction from the linear formula obtained by the nonparametric regression. 
The method is a Logical Regression Analysis(LRA), that is, a knowledge dis- 
covery algorithm consisting of regression analysis^ and rule extraction from the 
regression formulas. 

We applied the algorithm to artihcial data, and we conhrmed that the algo- 
rithm works well for artihcial data[ll]. We have applied the algorithm to real 
f-MRI brain images. The experiments are huger tapping and calculations. This 
paper reports that the algorithm works well for real f-MRI data, has rediscovered 
well-known facts regarding huger tapping and calculations, and discovered new 
facts regarding calculations. 

Section 2 explains the knowledge discovery from fMRI images by Logical 
Regression Analysis. Section 3 describes the experiments. 

2 Knowledge discovery from fMRI brain images by 
Logical Regression Analysis 

2.1 The outline 

The brain is 3-dimensional. In fMRI brain images, a set of 2-dimensional im- 
ages(slices) represents a brain. See Fig. 1. 5 slices are obtained in Fig. 1. Fig. 2 
shows a real fMRI brain image. When an image consists of 64 x 64(= 4096) pix- 
els, Fig. 2 can be represented as Fig. 3. In Fig. 3, white pixels mean activations 
and black pixels mean inactivations. Each pixel has the value of the activation. 

^ The regression analysis includes the nonlinear regression analysis using neural net- 
works 
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Fig. 3. An example of fMRI image 



An experiment consists of several measurements. Fig. 4 means that a subject 
repeats a task(for example, finger tapping) three times. “ON” in the upper part 
of the hgure means that a subject executes the task and “OFF” means that the 
subject does not executes the task, which is called rest. Bars in the lower part 
of the hgure means measurements. The hgure means 24 measurements. When 
24 images (samples) have been obtained, the data of a slice can be represented 
as Table 1. 

Y(N) in the class stand for on(off) of an experimental task. From Table 1, 
machine learning algorithms can be applied to fMRI brain images. In the case 
of Table 1, the attributes are continuous and the class is discrete. 

Attributes(pixels) in image data have strong correlations between adjacent 
pixels. Moreover it is very difficult or impossible to obtain sufficient fMRI brain 
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Table 1. fMRI Data 





1 


2 




4096 


class 


SI 


10 


20 




11 


Y 


S2 


21 


16 




49 


N 


S24 


16 


39 




98 


Y 



samples, and so there are few samples compared with the number of attributes 
(pixels). Therefore, the conventional supervised inductive learning algorithms 
such as C4.5[5] do not work well, which was conhrmed in [11]. 

Nonparametric regression works well for strong correlations between attributes 
and a small number of samples. The rule extraction algorithm can be applied to 
the linear formula obtained by nonparametric regression. The algorithm for the 
discovery of rules from fMRI brain images consists of nonparametric regression 
and rule extraction. 

2.2 Nonparametric regression analysis 

First, for simplihcation, the 1-dimensional case is explained[l]. Nonparametric 
regression is as follows: 

Let y stand for a dependent variable and t stand for an independent variable 
and let tj{J = l,..,m) stand for measured values of t. Then, the regression 
formula is as follows: 

y = +e(i = 

where aj are real numbers and e is a zero-mean random variable. When there 
are n measured values of y, 

Vi = = 1, -,n) 

In conventional linear regression, error is minimized, while, in nonparametric 
regression, error plus continuity or smoothness is minimized. 

In fMRI brain images, there are continuities among the pixels, that is, adja- 
cent measured values of the independent variable have continuity in the influence 
over the dependent variables[3]. Therefore, the continuity of coefficients a^s is 
added to the evaluation value as follows: 

i/«Er=i(i/»-K)^+AEr=iK'+i-«.)^ 

When A is hxed, the above formula is the function of ai{yi is the function of 
Gi). Therefore, ajS are determined by minimizing the evaluation value, and the 
optimal value of A is determined by cross validation. 

In 2-dimensional nonparametric regression, the evaluation value for the con- 
tinuity of coefficients aij is modffied. In 1 dimension, there are two adjacent 
measured values, while, in 2 dimensions, there are four adjacent measured val- 
ues. The evaluation value for the continuity of coefficients is not 




but the differences of hrst order between a pixel and the four adjacent pixels in 
the image. For example, in the case of pixel 66 in Fig. 3, the adjacent pixels are 
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pixel 2, pixel 65, pixel 67, and pixel 130, and the evaluation value is as follows: 
(flee — 0 , 2 )^ + (flee — £*65)^ + (aee ~ 0 - 67 )^ + (aee ~ aiso)^- 



2.3 Applying nonparametric regression analysis to fMRI brain 
images 

When nonparametric regression analysis is applied to fMRI brain images, there 
are a few problems. Nonparametric regression analysis should be applied only to 
the pixels corresponding to brains. Therefore, areas corresponding to brains are 
extracted from fMRI brain images. The extraction can be executed by Statis- 
tical Parametric Mapping:SPM[6], which is widely used in brain science. SPM 
transforms brains to the standard brain[7]. 

There is no continuity between the inside and the outside of a brain. There- 
fore, the continuity of parameters in nonparametric regression analysis is not 
assumed at the boundary. 



2.4 Rule extraction 

The rule extraction algorithm in the discrete domain 

The basic algorithm is that linear formulas are approximated by Boolean 
functions. Let (/,) be the values of a linear formula. Let {gi){gi = 0 or 1) be the 
values of Boolean functions. The basic method is as follows: 

^ T m > 0.5), 

\ 0{fi < 0.5). 

This algorithm minimizes Euclidean distance. The basic algorithm is exponential 
in computational complexity, and therefore, a polynomial algorithm is needed. 
The authors have presented the polynomial algorithm. The details can be found 
m [9], [10]. 



Extension to the continuous domain 

Continuous domains can be normalized to [0,1] domains by some normaliza- 
tion method. So only [0,1] domains have to be discussed. First, we have to present 
a system of qualitative expressions corresponding to Boolean functions, in the 
[0,1] domain. The expression system is generated by direct proportion, reverse 
proportion, conjunction and disjunction. The direct proportion is y = x. The in- 
verse proportion is y = I — x, which is a little different from the conventional one 
(y = —x), because y = 1 — * is the natural extension of the negation in Boolean 
functions. The conjunction and disjunction will be also obtained by a natural 
extension. The functions generated by direct proportion, inverse proportion, con- 
junction and disjunction are called continuous Boolean functions, because they 
satisfy the axioms of Boolean algebra. For example, z = *Vj/ means that when x 
increases (decreases) or y decreases(increases), z increases(decreases). For details, 
refer to [8]. In the domain [0,1], linear formulas are approximated by continuous 
Boolean functions. The algorithm is the same as in the domain {0, 1}. 

Note that the independent variables in nonparametric regression should be 
normalized to [0,1]. 
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2.5 Related techniques 

This subsection briefly explains two popular techniques, z-score and independent 
component analysis, and compares them with LRA. 



z-score 

z-score is widely used in fMRI brain images, z-score is calculated pixel- wise 
as follows: 

\Mt - Mc\ 

z — score = , , 

where 

Mt : Average of task images 
M c : Average of rest images 
at : Standard deviation of task images 
ac : Standard deviation of rest images 

Task images mean the images when a subject performs an experimental task. 
Rest images mean the images when a subject does not perform an experimental 
task. 

When z-score is 0, the average of task images equals the average of rest 
images. When z-score is 1 or 2, the difference between the average of task images 
and the average of rest images is big. 

The areas whose z-scores are big are related to the experimental task. How- 
ever, z-score does not tell which slices are related to the experimental task and 
does not tell the connections among the areas such as serial connection, parallel 
connection. 

ICA 

Independent Component Analysis(ICA) [2] is applied to fMRI brain images. 
LRA is advantageous compared with ICA respecting the following points: 

f. LRA uses classes. That is, LRA uses task/rest information, while ICA does 
not use task/rest information. 

2. LRA conserves the spatial topologies in the images, while ICA cannot con- 
serves the spatial topologies in the images. 

3. LRA works well in the case of small samples, while it is not sure if ICA works 
well in the case of small samples. 

4. LRA does not fall into a local minimum, while ICA falls into a local mini- 
mum. 

5. LRA’s outputs can represent the connections among areas, while ICA’s out- 
puts cannot represent the connections among the areas. 

3 Experiments 

LRA is applied to two tasks, hnger tapping and calculation. The brain part of 
fMRI brain images are extracted using Standard Parametric Mapping:SPM(a 




218 



Hiroshi Tsukimoto et al. 



software for brain images analysis[6]) and LRA is applied to each slice. Rule 
extraction is applied to the results of nonparametric analysis. That is, the linear 
functions obtained by nonparametric regression analysis are approximated to 
continuous Boolean functions, Therefore, the domains of the linear functions 
should be [0,1], and so the values of pixels should be normalized to [0,1]. See the 
table in subsection 2.1. 

The extracted rules are represented by conjunctions, disjunctions and nega- 
tions of areas. The conjunction of areas means the co-occurrent activation of the 
areas. The disjunction of areas means the parallel activation of the areas. The 
negation of an area means a negative correlation. 



3.1 Finger tapping 

The hrst experimental task is a hnger-to-thumb opposition task of the right hand, 
which is called huger tapping for simplihcation. The experimental conditions are 
as follows: 

Magnetic held : 1.5 tesla 

Pixel : 64x64 

Subject number : 1 

Task sample number : 30 
Rest sample number : 30 

Table 2 shows the errors of nonparametric regression analysis. The errors was 
dehned in 2.2. Slice 0 is the image of the bottom of brain and slice 33 is the image 
of the top of brain. Due to the experimental conditions, slices 0-4 have no data, 
which means that the bottom part images were not taken in the experiment. 



Table 2. Results of nonparametric regression analysis(hnger tapping) 



slice 


error 


slice 


error 


slice 


error 


slice 


error 


slice 


error 


0 


* 


8 


0.737 


16 


0.020 


24 


0.002 


32 


0.008 


1 


* 


9 


0.773 


17 


0.189 


25 


0.018 


33 


0.006 


2 


* 


10 


0.812 


18 


0.022 


26 


0.017 






3 


* 


11 


0.013 


19 


0.024 


27 


0.111 






4 


* 


12 


0.836 


20 


0.017 


28 


0.002 






5 


0.815 


13 


0.899 


21 


0.017 


29 


0.003 






6 


0.752 


14 


0.162 


22 


0.016 


30 


0.003 






7 


0.764 


15 


0.002 


23 


0.018 


31 


0.006 







Slices whose errors are small are related to the experimental task. The er- 
rors of slice 15, slice 24, and slice 28 are small. Therefore, rule extraction is 
applied to these slices. LRA can generate rules including disjunctions. However, 
the rules including disjunctions are too complicated to be interpreted by human 
experts in brain science, because they have paid little attention to the phenom- 
ena. Therefore, the rules including disjunctions are not generated. 
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Fig. 5 shows the extracted rule of slicelS, Fig. 6 shows the extracted rule 
of slice24, and Fig. 7 shows the extracted rule of slice28. White means high 
activity, dark gray means low activity, and black means non-brain parts. White 
and dark gray ares are connected by conjunction. For example, let A stand for 
the white area in Fig. 6 and B stand for the dark gray area in Fig. 6. Then, 
Fig. 6 is interpreted as A A B, which means area A is activated and area B is 
inactivated. 

Figures are taken from feet, and so the left side in the hgures means the right 
side of the brain, and the right side in the hgures means the left side of the brain. 
The upper side in the hgures means the front of the brain, and the lower side in 
the hgures means the rear of the brain. 



Q 



Fig. 6. hnger slice 24 



White area in the upper side of Fig. 5 is related to movement planning, 
although the hnger tapping is not so complicated. White area in the left side is 
basal nuclei, which adjust the movement. White area in Fig. 6 is supplementary 
motor area, which is related to voluntary movement or movement planning. 
White area in the right side of Fig. 7 means the activity of motor area and sensory 
area of hngers. LRA has rediscovered the relations that humans discovered 
LRA has generated rules, but conjunctions or negations are difficult to be 
interpreted by human experts, because the researchers have paid no attention 
to the co-occurrences of areas and negative correlations. 





3.2 Calculation 

The second experimental task is calculation. In the experiment, a subject adds 
a number repeatedly in the brain. The experimental conditions follow: 
Magnetic field : l.Stesla 

Pixel number : 64x64 

Subject number : 8 

Task sample number : 34 
Rest sample number : 36 
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Table 3 shows the errors of nonparametric regression analysis. The errors 
of nonparametric regression analysis are bigger than those in the case of hnger 
tapping, which means that the regions related to calculation are located in 3 
dimension. We focus on the slices whose errors are small, that is, the slices 
related to calculation. 133, ...,336 in the table are the numbers of subjects. 



Table 3. Results of nonparametric regression analysis(calculation) 



slice 


133 


135 


312 


317 


321 


331 


332 


336 


0 


0.924 


0.882 


0.444 


0.547 


O 

b 

o 

CO 


0.870 


0.455 


0.306 


1 


0.418 


0.030 


0.546 


0.587 


0.298 


0.814 


0.028 


0.946 


2 


0.375 


0.538 


0.337 


0.435 


0.278 


0.723 


0.381 


0.798 


3 


0.016 


0.510 


0.585 


0.430 


0.282 


0.743 


0.402 


0.798 


4 


0.456 


0.437 


0.519 


0.446 


0.157 


0.636 


0.419 


0.058 


5 


0.120 


0.469 


0.473 


0.376 


0.265 


0.698 


0.385 


0.366 


6 


0.965 


0.434 


0.602 


0.138 


0.380 


0.475 


0.420 


0.541 


7 


1.001 


0.230 


0.430 


0.309 


0.119 


0.175 


0.482 


0.547 


8 


1.001 


0.388 


0.434 


0.222 


0.478 


0.246 


0.387 


0.704 


9 


0.968 


0.473 


0.362 


0.281 


0.390 


0.409 


0.193 


0.913 


10 


1.001 


0.008 


0.447 


0.357 


0.341 


0.358 


0.227 


0.908 


11 


1.001 


0.066 


0.383 


0.380 


0.167 


0.275 


0.115 


0.914 


12 


1.001 


0.736 


0.302 


0.312 


0.397 


0.021 


0.181 


0.909 


13 


0.828 


0.793 


0.525 


0.222 


0.455 


0.845 


0.204 


0.733 


14 


0.550 


0.822 


0.349 


0.523 


0.023 


0.229 


0.130 


0.474 


15 


0.528 


0.805 


0.298 


0.569 


0.107 


0.439 


0.338 


0.374 


16 


0.571 


0.778 


0.494 


0.509 


0.008 


0.354 


0.377 


0.493 


17 


0.009 


0.007 


0.159 


0.615 


0.238 


0.159 


0.561 


0.774 


18 


0.089 


0.060 


0.663 


0.010 


0.011 


0.033 


0.519 


0.711 


19 


0.642 


0.238 


0.573 


0.405 


0.185 


0.426 


0.470 


0.689 


20 


0.887 


0.514 


0.383 


0.376 


0.149 


0.177 


0.214 


0.430 


21 


0.282 


0.532 


0.256 


0.028 


0.018 


0.219 


0.303 


0.548 


22 


0.281 


0.415 


0.613 


0.167 


0.045 


0.213 


0.352 


0.528 


23 


0.521 


0.422 


0.229 


0.227 


0.048 


0.306 


0.050 


0.450 


24 


0.814 


0.270 


0.401 


0.439 


0.013 


0.212 


0.350 


0.570 


25 


0.336 


0.394 


0.411 


0.195 


0.469 


0.148 


0.414 


0.689 


26 


0.603 


0.008 


0.390 


0.180 


0.477 


0.107 


0.358 


0.541 


27 


0.535 


0.062 


0.324 


0.191 


0.308 


0.279 


0.455 


0.413 


28 


0.719 


0.010 


0.371 


0.271 


0.167 


0.436 


0.237 


0.649 


29 


0.942 


0.310 


0.400 


0.257 


0.169 


0.353 


0.023 


0.775 


30 


0.898 


0.360 


0.547 


0.283 


0.209 


0.467 


0.464 


0.157 


31 


0.746 


0.026 


0.023 


0.445 


0.187 


0.197 


0.084 


0.195 



LRA can generate rules including disjunctions. However, the rules including 
disjunctions are too complicated to be interpreted. Therefore, the rules including 
disjunctions are not generated. 
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Tabled summarizes the results of LRA. Numbers in parenthesis mean slice 
numbers. Activation in the left angular gyrus and supramarginal gyrus was ob- 
served in 4 and 3 cases, respectively, and that in the right angular gyrus and 
supramarginal gyrus was observed in 3 cases and 1 case, respectively. Clini- 
cal observations show that damage to the left angular and supramarginal gyrii 
causes acalculia which is dehned as an impairment of the ability to calculate. 
Despite the strong association of acalculia and left posterior parietal lesions. 



Table 4. Results of LRA 



SubjectNO. 




Left 


Right 


133 


cingulate gyrus(17) 


Cerebellum (3) 
superior frontal gyrus(17) 
inferior frontal gyrus(17) 
superior temporal plane(17,18) 
middle frontal gyrus(18,21) 
angular gyrus(21,22) 


Cerebellum(0,5) 
middle frontal gyrus(25) 


135 




inferior frontal gyrus(17,18) 
superior temporal plane(17,18) 
precuneus(26,28) 


superior frontal gyrus(26) 
superior parietal gyrus(26,28) 


312 


cingulate gyrus(21) 


inferior frontal gyrus(15) 
angular gyrus(21) 
supramarginal gyrus(18,21) 
middle frontal gyrus(23) 


angular gyrus (17) 


317 


cingulate gyrus(27) 


inferior frontal gyrus(18,21) 
cuneus(21,22,26,27) 


angular gyrus(26) 


321 


cingulate gyrus(16,22,24) 


inferior frontal gyrus (14) 
postcentral gyrus(16,18) 
cuneus(21) 

parieto-occipital sulcus(22) 
supramarginal gyrus(24) 


cuneus(16,18) 
parieto-occipital sulcus(21) 
supramarginal gyrus(22,24) 


331 


cingulate gyrus(25,26) 


inferior frontal gyrus(12,17,18) 
angular gyrus(17,18,26) 
supramarginal gyrus(17,18) 


angular gyrus(17,18,26) 
middle temporal gyrus(7,12,17) 


332 




inferior temporal gyrus(l) 
Cerebellum) 1) 
postcentral gyrus(33) 
middle temporal gyrus(12) 
pre-, post-central gyrus(14) 
angular gyrus(23) 
middle frontal gyrus(29) 


inferior temporal gyrus(l) 
Cerebellum) 1) 
middle frontal gyrus(29) 
superior parietal gyrus(29,43) 


336 




Cerebellum(0,5) 
middle temporal gyrus(4,5) 
middle frontal gyrus(30,31) 
precentral gyrus (31) 
superior frontal gyrus(32) 


Cerebellum(0,5) 
superior parietal gyrus(30) 
occipital gyrus(ll) 
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there are certain characteristics of acalculia that have led to the suggestion of a 
right-hemispheric contribution. Clinical observations also suggest that acalculia 
is caused by lesions not only in the left parietal region and frontal cortex but 
also in the right parietal region. Fig. 8 shows slice 18 of subject 331 and Fig. 9 
shows slice 26 of subject 331. 

Signihcant activation was observed in the left inferior frontal gyrus in 6 out 
of 8 cases. On the other hand, none was observed in the right inferior frontal 
gyrus. The result suggests that the left inferior frontal gyrus including Broca’s 
area is activated in most subjects in connection with implicit verbal processes 
required for the present calculation task. Furthermore, signihcant activation in 
frontal region including middle and superior frontal regions was found in 8 cases 
(100%) in the left hemisphere and in 3 cases in the right hemisphere. The left 
dorsolateral prefrontal cortex may play an important role as a working memory 
for calculation. Fig, 10 shows slice 17 of 135 and Fig. 11 shows slice 18 of subject 
317. 

In addition to these activated regions, activation in cingulate gyrus, cere- 
bellum, central regions and occipital regions was found. The activated regions 
depended on individuals, suggesting different individual strategies. Occipital re- 
gions are related to spatial processing, and cingulate gyrus is related to intensive 
attention. Central regions and cerebellum are related to motor imagery. 5 out 
of 8 subjects use cingulate gyrus, which means that they are intensively atten- 
tive. Fig. 12 shows slice 17 of 133. 3 out of 8 subjects use cerebellum, which is 




Fig. 11. cal. slice 18 317 Fig. 12. cal. slice 17 133 Fig. 13. cal. slice 1 332 
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thought not to be related to calculation. Fig. 13 shows slice 1 of 332. The above 
two results are very interesting discoveries that have never been experimentally 
conhrmed so far. The problem of whether these regions are specihcally related 
to mental calculation or not is to be investigated in further research with many 
subjects. 

LRA has generated rules consisting of regions by conjunction and negation. 
As for conjunctions and negations, the results showed the inactivated regions 
simultaneously occurred with the activated regions. In the present experiment, 
inactivation in the brain region contralateral to the activated region was ob- 
served, suggesting inhibitory processes through corpus callosum. LRA has the 
possibility of providing new evidence in brain hemodynamics. 



3.3 Future work 

In the experiments, LRA is applied to slices, that is, 2-dimensional fMRI brain 
images. However, complicated tasks such as calculation are related to at least 
a few areas, and so the application of LRA to a set of a few slices is neces- 
sary for a fruitful knowledge discovery from fMRI brain images. Moreover, 2- 
dimensional nonparametric regression analysis can be regarded as knowledge 
discovery after attribute selection. Therefore, it is desired that LRA be applied 
to 3-dimensional fMRI brain images. However, the nonparametric regression 
analysis of 3-dimensional fMRI brain images needs a huge computational time. 
Therefore, the computational time should be reduced, which is included in fu- 
ture work. The interpretation of conjunctions and negations are also included in 
future work. 

4 Conclusions 

This paper has presented an LRA algorithm for the discovery of rules from fMRI 
brain images. The LRA algorithm consists of nonparametric regression and rule 
extraction from the linear formula obtained by the nonparametric regression. 
The LRA algorithm has discovered new relations respecting brain functions. 

References 

1. Eubank, R.L.: Spline Smoothing and Nonparametric Regression, Marcel 
Dekker,Newyork, 1988. 

2. Lee,T.W.: Independent Component Analysis, Kluwer Academic Publishers, 1998. 

3. Miwa,T.:private communication, 1998. 

4. Posner, M.l. and Raichle,M.E.: Images of Mind, W H Ereeman & Co, 1997. 

5. Quinlan, J.R.: Induction of decision tree. Machine Learning 1, pp. 81-106, 1986. 

6. http:/ /www.fil. ion. ucl.ac.uk/spm/ 

7. Talairach,J. and Tournoux: Coplanar Streoaxic atlas of the human brain. New York: 
Thieme Medica. 1988. 




224 



Hiroshi Tsukimoto et al. 



8. Tsukimoto, H-(1994): On continuously valued logical functions satisfying all axioms 
of classical logic, Systems and Computers in Japan, Yol.25, 12, 33-41, SCRIPTA 
TECHNICA, INC.. 

9. Tsukimoto, H. and Morita,C.: Efficient algorithms for inductive learning-An ap- 
plication of multi-linear functions to inductive learning. Machine Intelligence 14, 
pp. 427-449, Oxford University Press, 1995. 

10. Tsukimoto, H., Morita,C., Shimogori,N.: An Inductive Learning Algorithm Based 
on Regression Analysis. Systems and Computers in Japan, Vol.28, No. 3. pp. 62-70, 
1997. 

11. Tsukimoto, H. and Morita,C.: The discovery of rules from brain images, Tfte First 
International Conference on Discovery Science, pp. 198-209, 1998. 

12. Tsukimoto, H: Extracting Rules from Trained Neural Networks, IEEE Transactions 
on Neural Networks, pp. 377-389, 2000. 




Human Discovery Processes Based on 
Searching Experiments in Virtual Psychological 
Research Environment 



Kazuhisa Miwa 

Graduate School of Human Informatics, Nagoya University, Nagoya 464-8601, Japan 
miwa® cog .human. nagoya-u. ac.jp 



Abstract. For designing experiments in social and human sciences, we 
must often consider various complex factors that seem to decide subjects’ 
performance. It is sometimes difficult to make complete experimental 
planning in which hypotheses guiding the experiments are established 
prior to executing the experiments. Even if the situation stands, experts 
in the held systematically organize their experimental processes. We pro- 
pose Searching Experimental Scheme (SES) that enables them do so. For 
conhrming the validity of SES, we construct virtual psychological exper- 
imental environment using a cognitive simulator in which subjects try 
to generate hypotheses and conduct experiments as scientists do. We 
analyze the subjects’ behavior based on SES and discuss the relation 
between the characteristics of their behavior and their performance of 
discovering targets. 



1 Introduction 

We can divide the ways of acquiring empirical data in the process of discov- 
ery into two basic categories: experimentation and observation. In experimen- 
tation, data are systematically gathered based on previously formed hypothe- 
ses. In experimental psychology, the most orthodox example is Factorial Design 
(FD) experiments in which focused factors by a researcher are systematically 
manipulated by clearly established hypotheses, and the relation between the 
manipulated factors and the observed data is identihed. On the other hand, in 
observation, the systematic data collection as above is not made. In usual cases, 
hypotheses for manipulating experimental factors cannot be formed. So ways of 
gathering data become Trial and Error (TE) search in which experimental data 
are randomly observed for forming an initial hypothesis. 

Experimental design in real research environment usually reflects the charac- 
teristics of both of the two typical categories above. For example, searching some 
levels of certain controlled factors may be lost even though the global structure 
of the experimental design is FD; or the experimental design is locally FD but 
the global structure (i.e. the relation of each local unit) may be TE. We call these 
intermediate ways of experimentation “searching experiments” , which is a key 
concept of this study. The process of searching experiments appears when (1) an 
experimental space that subjects try to search is huge, so the subjects cannot 

S. Arikawa and S. Morishita (Eds.): DS 2000, LNAI 1967, pp. 225-238, 2000. 
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search whole combinations of the experimental conditions at once, (2) a goal 
itself is ambiguous, that is, a research objective itself is being searched, and (3) 
the relation between independent and dependent factors cannot be clearly pre- 
dicted because of the lack of knowledge on the research domain or the existence 
of complex interaction among the experimental factors. 

Searching experiments are essentially important especially in social and hu- 
man sciences because most of the research situations are relatively complex and 
satisfy the conditions above [1] [7]. Researchers use searching experiments ef- 
fectively for organizing their experimental processes systematically under the 
complex research situations. In this study, we propose “Searching Experimental 
Scheme” that enables researchers perform systematic search even though well- 
organized experimental planning such as FD experiments cannot be adopted. 
Then we analyze subjects’ behavior based on the scheme. We also discuss the 
relation between the characteristics of searching behavior and the performance 
of subjects’ discovering targets. To do so, using a discovery task that satisRes 
the conditions in which searching experiments appear, we let subjects experience 
a series of experimental processes, such as setting up a research objective, form- 
ing a hypothesis, designing experiments, performing experiments, interpreting 
experimental results, and rearranging additional experiments. 

To discuss the issues above, it is difhcult to let subjects perform real psy- 
chological experiments because of its executing cost. So in this study we let 
them perform virtual psychological experiments using a cognitive simulator that 
is constructed as a computer program instead of performing real experiments. 
Subjects behave as an experimental psychologist in the research environment 
provided by the simulator [8]. 

2 Virtual Psychological Research Environment 

2.1 Wason’s 2-4-6 Task 

The simulator used in this study is a cognitive model that simulates collaborative 
discovery processes in which two problem solvers interactively solve a traditional 
discovery task, the Wason’s 2-4-6 task, while referring mutual experimental re- 
sults [9]. Subjects participate in this experiment as an experimental psychologist 
who studies collaborative discovery processes using the Wason’s task. 

The standard procedure of the 2-4-6 task is as follows. Subjects are required 
to find a rule of relationship among three numerals. In the most popular situ- 
ation, a set of three numerals, “2, 4, 6”, is presented to subjects at the initial 
stage. The subjects form hypotheses about the regularity of the numerals based 
on the presented set. Subjects conduct experiments by producing a new set of 
three numerals and present them to an experimenter. This set is called an in- 
stance. An experimenter gives Yes feedback to subjects if the set produced by 
subjects is an instance of the target rule, or No feedback if it is not an instance 
of the target. Subjects carry out continuous experiments, receive feedback from 
each experiment, and search to find the target. 
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Two types of experimentation, Ptest and Ntest, are considered. Ptest is ex- 
perimentation using a positive instance for a hypothesis, whereas Ntest is ex- 
perimentation using a negative instance. For example, when a subject has a 
hypothesis that three numerals are evens, an experiment using an instance, “2, 
8, 18”, corresponds to Ptest, and an experiment with “1, 2, 3” corresponds to 
Ntest. Note that the positive or negative test is dehned based on a subject’s 
hypothesis, on the other hand. Yes or No feedback is on a target. We should 
also notice the pattern of hypothesis reconstruction based on the combination 
of a hypothesis testing strategy and an experimental result (Yes or No feedback 
from an experimenter). When Ptest is conducted and No feedback is given, the 
hypothesis is disconhrmed. Another case of disconhrmation is the combination 
of Ntest and Yes feedback. On the other hand, the combinations of Ptest - Yes 
feedback and Ntest - No feedback conhrm the hypothesis. 

2.2 Interactive Production System 

We have developed an interactive production system architecture for construct- 
ing the cognitive simulator and providing the virtual psychological research en- 
vironment. The architecture consists of Rve parts: production sets of System A; 
production sets of System B; a working memory of System A; a working mem- 
ory of System B; and a commonly shared blackboard (see Figure 1). The two 
systems interact through the common blackboard. That is, each system writes 
elements of its working memory on the blackboard and the other system can 
read them from the blackboard. The model solving the Wason’s 2-4-6 task has 
been constructed using this architecture. 




working memory 
System B 



Fig. 1. Basic structure of the interactive production system architecture 



The model has the knowledge on the regularities of three numerals, which 
is used for hypothesis generation in the process of solving the 2-4-6 task. The 
knowledge is organized as the dimension- value lists. For example, “continuous 
evens” , “three evens” , and “the first numeral is even” are example values of a 
dimension, “Even-Odd”. The dimensions the systems use are: Even-Odd, Order, 
Interval, Range of digits. Certain digit. Mathematical relationship. Multiples, 
Divisors, Sum, Product, Different. 
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The way of searching the hypothesis space is controlled by the system’s pa- 
rameter that decides the hypothesis formation strategy (see Table 2). 

Basically the model searches the hypothesis space randomly in order to gen- 
erate a hypothesis (when a value of the parameter, [random], is set). Three 
hypotheses, “three continuous evens”, “interval is 2”, and “three evens”, are 
particular. Human subjects tend to generate these hypotheses at Rrst when the 
initial instance, “2, 4, 6”, is presented. So our model also generates these hy- 
potheses first prior to other possible hypotheses when a value of the parameter, 
[human], is set. 

You can see the detailed specifications of this model in Miwa & Okada, 1996 
[5]. 

2.3 An Example Behavior of the Simulator 

Table 1 shows an example result of the computer simulations. The target was 
“Divisor of three numerals is 12”. Two systems interactively found the target. 
One system. System A, always used Ptest in its experiments, and the other. Sys- 
tem B, used Ntest. The table principally consists of three columns. The left-most 
and right-most columns indicate hypotheses formed by System A and System 
B respectively. The middle column indicates experiments, that is, generated in- 
stances, Yes or No feedback, and the distinction of Ptest or Ntest conducted by 
each system. Each experiment was conducted alternately by two systems, and 
the results of the experiments were sent to both of the two systems. The left-most 
number in each column indicates a series of processing, from through #41. 
The right-most number in the left-most and right-most columns indicates the 
number of each hypothesis being continuously confirmed. System A discouhrmed 
its hypotheses at #4, #10, #16, which were introduced by self-conducted ex- 
periments at #3, #9, #15. System B discouhrmed its hypotheses at #17, #29, 
which were introduced by other-conducted experiments at #15, #27. 

What we should note is that the simulator actually simulates human dis- 
covery processes. The validity of the simulator as a cognitive model has been 
already verified in other our papers. So usage of this simulator provides more 
realistic research environment in which we can observe searching processes of 
subjects who behave as an experimental psychologist. 

2.4 Parameters Deciding the Model’s Behavior 

The parameters that decide the simulator’s behavior consist of 6 factors that are 
indicated in Table 2. Five parameters except a first parameter. Target, are set 
up for controlling each of two interacting systems. 

Let us now compare the situation in which searching experiments appear (see 
1.) and the virtual experimental environment provided by the simulator. First, 
it is impossible to search the whole experimental space because it consists of two 
hundred million conditions (= 35 x 5^ x 4^ x 5^ x 5^ x 5^). Second, focused factors, 
which are decided based on research objectives, are independently selected by 
subjects themselves; actually achieved solutions of every subject are different. 
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Table 1. An example behavior of the simulator 



Hypotheses by System A 


Experiments 


Hypotheses by System B 








1 


2, 4, 6 


Yes 










2 


Continuous evens numbers. 


0 


3 


4, 6, 8 


No 


Ptest by SysA 








4 


The product is 48. 


0 


6 


6, 6, -17 


No 


Ntest by SysB 


5 


The sum is a maltiple of 4. 


0 


8 


The product is 48. 


1 


9 


24, -1, -2 


No 


Ptest by SysA 


7 


The sum is a maltiple of 4. 


1 


10 


First + Second = Third. 


0 


12 


3, -8, -20 


No 


Ntest by SysB 


11 


The sum is a maltiple of 4. 


2 


14 


First + Second = Third. 


1 


15 


-10, 2, -8 


No 


Ptest by SysA 


13 


The sum is a maltiple of 4. 


3 


16 


Divisor is 12. 


0 


18 


-5, -14, -9 


No 


Ntest by SysB 


17 


The second is 4. 


0 


20 


Divisor is 12. 


1 


21 


2, 4, 6 


Yes 


Ptest by SysA 


19 


The second is 4. 


1 


22 


Divisor is 12. 


2 


24 


-17, 3, 12 


No 


Ntest by SysB 


23 


The second is 4. 


2 


26 


Divisor is 12. 


3 


27 


2, 12, -12 


Yes 


Ptest by SysA 


25 


The second is 4. 


3 


28 


Divisor is 12. 


4 


30 


8, 12, -2 


No 


Ntest by SysB 


29 


Divisor is 12. 


0 


32 


Divisor is 12. 


5 


33 


2, 6, -2 


Yes 


Ptest by SysA 


31 


Divisor is 12. 


1 


34 


Divisor is 12. 


6 


36 


-2, -7, -8 


No 


Ntest by SysB 


35 


Divisor is 12. 


2 


38 


Divisor is 12. 


7 


39 


4, 3, -12 


Yes 


Ptest by SysA 


37 


Divisor is 12. 


3 


40 


Divisor is 12. 


8 










41 


Divisor is 12. 


4 



Third, there are complex interactions especially among three factors: hypothesis 
testing strategies, hypothesis formation strategies, and targets. These points 
support that the research environment used in this study embodies the situation 
in which searching experiments appear. 



3 Experiments 

Six graduate students participated in the experiment. They attended a graduate 
school class given by the author. The topic of the class was experimental psycho- 
logical studies on human hypothesis testing. So the experimental situation was 
that the subjects who had obtained basic psychological knowledge on human 
hypothesis testing were required to study collaborative discovery processes in 
the experimental environment, applying the basic knowledge to the collabora- 
tive situation. Each subject individually participated in the experiment. After 
instructional guidance for 20 minutes, the main experiment, in which the sub- 
jects manipulated the simulator and studied collaborative discovery processes, 
was carried out for 2 hours; then an interview for 15 minutes was conducted. 

In the main experiment, the subjects performed experiments manipulating 
the simulator independently. An experimental planning sheet was used; the sheet 
consisted of 5 items: (1) a research objective (what do they investigate), (2) 
hypotheses, (3) an experimental design, (4) experimental results, and (5) inter- 
pretation of the experimental results. The subjects Riled out the former three 
items, and then they actually conducted experiments manipulating the simula- 
tor. After the experiments they filled out the latter two items of the sheet. They 
repeated this procedure during the main experiment. In the interview after the 
experimental session, subjects’ conclusions (i.e. what do they find) through the 
whole experiments were identified. 
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Table 2. Six factors of the simulator 



Factors 


Levels 


Target [T] 


[l]-[35] 

Thirty-five kinds of targets that were used in the experiment. For example, Target #1 
is "ascending numbers"; Target #35 is "three different numbers". 


Hypothesis 

testing 

strategies [HT] 


[0], [25], [50], [75], [100] 

The probability of conducting positive tests in generating instances. [100] and [0] 
mean that systems always conduct positive tests and negative tests, respectively. 


Hypothesis 
formation 
strategies [HF] 


[human], [random], [specific], [general] 

[human] means that systems generate hypotheses as humans do. [random]: generating 
hypotheses randomly, [specific]: generating specific hypotheses prior to general ones, 
[general]: generating general hypotheses prior to specific ones. 


# of activated 

instances 

[AI] 


[all], [6], [5], [4], [3] 

The number of instances that can be activated at once in the working memory when 
generating hypotheses. 


# of maintained 

hypotheses 

[RH] 


[all], [5], [4], [3], [2] 

The number of rejected hypotheses that can be maintained in the working memory. 


Condition for 
terminating 
the search [TE] 


[all], [5], [4], [3], [2] 

The number of continuos confirmations when systems terminate the search. [2] means 
when a hypothesis is continuously confirmed two times, systems recognize the 
hypothesis as the solution, and terminate the search. 



After the combination of every level of the 6 factors is decided, twenty simula- 
tions are automatically executed in the condition. Then the experimental system 
presents (1) the ratio of each of the two systems correctly hnding a target, (2) 
the ratio of at least one of the two systems reaching a correct solution, and (3) 
the average number of generated instances for reaching a correct solution. The 
system also presents a model’s solution process of each simulation in addition to 
the Rnal results as above; however, on the basis of the experimenter’s instruc- 
tion, the subjects only focus on the final performance of the systems and try to 
find factors that explain the performance. The experimental system automati- 
cally records subjects’ experimental behavior. Additionally the processes were 
also recorded by a video camera, and subjects’ verbal protocols were gathered. 
Those protocols were used as secondary data for identifying subjects’ behavior 
when their description on the experimental planning sheet was ambiguous. 

4 Searching Experimental Scheme 

4.1 Expanded Search Within/Out of Focused Factors 

In this study, we describe subjects’ experimental processes based on “Searching 
Experimental Scheme” (SES). Eigure 2 shows the experimental space consisting 
of the 6 factors of the simulator, that is, the combinations of every level of each 
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factor [2]. From Factor 2 through Factor 6, every level is decided in each of the 
two systems. The bold lines of Figure 2 show an example combination: Factor 
1, a used target is “ascending numbers” (Target #1); Factor 2 and Factor 3, the 
combination of hypothesis testing and formation strategies is positive testing 
and specihc formation strategies in one system v.s. negative testing and general 
formation strategies in the other system; Factor 4, whole instances in the working 
memory can be activated; Factor 5, every hypothesis can be maintained in the 
memory; Factor 6, search is terminated when a hypothesis is supported by three 
continuous conhrmations. 



Factor 1 Factor 2 Factor 3 Factor 4 Factor 5 Factor 6 

hypo. hypo. # of # of termination 

testing formation activated remember- of 

target strategy strategy instances ing hypo. experiments 



X \ 1 i i / 




Fig. 2. experimental space of the simulator 



Subjects’ behavior will be described on SES shown in Figure 3. Figure 3 con- 
sists of three basic units, Unit All, Unit A12, and Unit A21. Each unit corre- 
sponds to a set of subjects’ searching behavior. We regard a series of continuous 
experiments guided by single experimental design on a piece of experimental 
planing sheet as a set of searching behavior. 

In Unit All, a subject manipulates Factor n and Factor m, and searches some 
levels of the factors indicated by the bold lines. We call these manipulated factors 
“focused factors” . Focused factors are indicated by dark gray boxes. Next in Unit 
A12, another factor. Factor p, indicated by a light gray box, is manipulated while 
Rxing the levels of the focused factors already searched in Unit All. We call this 
searching behavior “expanded search out of focused factors”. 

Moreover, subjects do not necessarily search whole levels of focused factors 
within a single unit; so they often conduct additional search of the focused 
factors. For example, in Unit A21, a subject searches other levels of the focused 
factors than the levels that have been already searched in Unit All. We call this 
searching behavior “expanded search within focused factors”. Although subjects 
cannot search all levels of focused factors at once because of their cognitive 
resource constraints, they try to analyze the effects of the focused factors on the 
total performance by conducting the expanded search within focused factors. 
Moreover, by conducting the expanded search out of focused factors, they try to 
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Series A1 



Factor n Factor m Factor p 





Series A2 



Fig. 3. Searching experimental scheme 



know how the results obtained on the focused factors are affected by fluctuation 
of other factors. The expanded search within/out of focused factors reflects the 
characteristics of well organized searching experiments. 



4.2 Levels of Searching Behavior 



An important point in using SES dehned above is that we can identify several 
levels of regularity of subjects’ searching behavior. First, on the most basic level, 
a chunk (a unihed set of subjects’ searching behavior) is represented as each of 
Unit All, Unit A12, and Unit A22. We call each chunk a “Unit”. Next on the 
second level, a chunk is constructed by expanded search out of focused factors. 
We call this chunk a “Series”. On this level, the subjects’ behavior in Figure 3 
is unihed into two chunks: one chunk is Series A1 that consists of Unit All and 
Unit A12 and the other is Series A2. Finally on the third level, whole subjects’ 
behavior in Figure 3, organized by expanded search within focused factors, is 
regarded as one chunk. We call this chunk on the highest level a “Block”. 

Now we should dehne each termination of a Series and a Block. A Series 
continues when subjects manipulate other factors than focused factors while 
hxing the already searched levels (or a part of the levels) of the focused factors. 
A Series terminates when conducting expanded search out of focused factors 
while shifting the search of the focused factors to new levels that have not been 
examined. A Block continues when subjects manipulate focused factors while 
hxing the already searched levels (or a part of the levels) of other factors than 
the focused factors. A Block terminates when both of focused factors and other 
factors are manipulated at once. 

Figure 4 shows an example searching behavior of Subject B described based 
on SES. 
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Fig. 4. An example behavior of Subject B described based on SES 



4.3 Three Stages of Chunking 

Figure 5 shows the total number of experiments of each subject and the numbers 
of chunks on the three levels (Unit, Series, and Block) dehned in the previous 
section. Constructing chunks on the higher levels means higher organization 
of subjects’ searching behavior; so Figure 5 indicates the situation of phased 
organization processes of subjects’ searching behavior. 

Now to model the patterns of the phased organization process, let us consider 
the 2 factorial (3 x 3) experimental design. Figure 6 (a) shows the case in which 
the experiments are performed based on FD in which all levels of two focused 
factors are searched at once. In this case a Unit is equal to a Block. Searching 
behavior is only organized through the process of constructing a Unit from indi- 
vidual experiments. We call this organization process “Rrst stage of chunking”. 
On the other hand, when experiments are performed based on TE search, every 
experiment is independent from each of the former and latter experiments; so 
no chunking happens. In this case each single experiment constructs a Block 
(see Figure 6 (e)). Characteristics of subjects’ behavior of searching experiments 
appear in expanded search within/out of focused factors; they can be modeled. 
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Fig. 5. Three levels of chunking by 6 subjects 



from the viewpoint of the three levels of chunking, based on the patterns de- 
picted in Figure 6 (b) through (d). In Figure 6 (b), after manipulating a single 
focused factor, subjects conduct expanded search out of the focused factor. We 
call this organization of searching behavior “second stage of chunking”. In Fig- 
ure 6 (c), although a subject manipulates two factors at once, the whole levels of 
the focused factors are not searched in the Rrst unit; so expanded search within 
the focused factors appears. We call this organization process of behavior “third 
stage of chunking”. In Figure 6 (d), both types of expanded search appear. 



4.4 Compression Ratio of Chunking 

We understand, through comparing Figure 5 and Figure 6, the behavior of Sub- 
ject A represents the characteristics of FD experimental processes, whereas the 
behavior of Subject F represents TE search from the viewpoint of the three 
stages of chunking. The behavior of other four subjects represents the charac- 
teristics of searching experiments in which they organize their behavior on the 
second and third stages of chunking. 

To clarify the discussions above, we define the compression ratio of chunking. 
The compression ratios on the first, second, and third stages of chunking are 
defined as the ratio of the number of Units to the total number of experiments, 
the ratio of the number of Series to the number of Units, and the ratio of the 
number of Blocks to the number of Series, respectively. As the compression ratio 
decreases, it means that higher compression is made. FD experimental behavior 
is structured only on the first stage of chunking on which higher compression 
is made; so the compression ratio is reduced from 1.0 whereas chunking on the 
second and third stages is not performed on which the compression ratios almost 
equal 1.0. On the other hand, in TE searching behavior the compression ratio 
on any stage of chunking nearly equals 1.0. The characteristics of behavior of 
searching experiments appear on the second and third stages of chunking on 
which the compression ratio is relatively reduced from 1.0. 
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Fig. 6. Patterns of three stages of chunking 



Figure 7 shows the compression ratios of each of the six subjects on the 
three stages of chunking. For example, let us consider an example behavior of 
Subject B dipicted in Figure 4. The compression ratios of Subject B on the three 
stages of chunking were .32, .45, and .80, because the number of experiments. 
Units, Series, and Blocks were, as seen in Figure 4, thirty-four, eleven, Rve, 
and four (also see Figure 5). Figure 7 indicates that the compression ratio of 
Subject A on the first stage is the smallest, and the compression ratios on the 
second and third stages nearly equal 1.0; so the behavior of Subject A reflects 
the characteristics of FD experimental processes. Subject F makes no chunks 
on any stage on which the compression ratio is relatively high; so the behavior 
of Subject F reflects the characteristics of TE search. In terms of other four 
subjects, chunking on the second or third stages, in addition to the first stage, is 
performed; so their behavior reflects the characteristics of searching experiments. 
Additionally, Figure 7 indicates that Subject D organizes his behavior on the 
second stage by expanded search out of focused factors because the compression 
ratio on this stage is smaller than that on the third stage. On the other hand 
Subject E organizes it on the third stage by expanded search within focused 
factors. 
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Fig. 7. Compression ratio of chunking 



5 Searching Behavior and Performance of Discovery 

5.1 Categorization of Final Solntions 

Next we consider the relation between the characteristics of searching behavior 
described above and Rnal solutions reached by each of the six subjects. In Table 
3, searched factors related to the final solutions by the six subjects are indicated, 
each of which is classified from the following two viewpoints. First, the solutions 
are divided into two categories from the viewpoint of their generality. Solutions 
in one category refer to the factors that affect the system’s performance while 
comparing several levels of the factors or mentioning to the effects of fluctuations 
of other factors. One example is “in terms of hypothesis formation strategies, 
the combination of the specific and general strategies produces the highest per- 
formance regardless of fluctuations of other factors.” On the other hand, some 
subjects simply reported an individual level of searched factors that seem to 
decide the system’s performance. One example is “in terms of hypothesis for- 
mation and testing strategies, when the former is the general strategy and the 
latter is the combination of the positive and negative testing strategies, the ratio 
of correct solutions reaches high.” We call the former type of solutions general 
solutions whereas the latter specific solutions. 

As the second viewpoint, the solutions in Table 3 are also classified from their 
validity. The correctness of each solution is decided based on both of knowledge 
on human discovery processes which has been obtained from cognitive psycho- 
logical studies using the Wason’s task [3] [4] and knowledge on regularities of 
our simulator’s behavior identified in other our papers [5] [6]. We can divide the 
solutions of the six subjects into two categories from the two viewpoints men- 
tioned above. One type of solutions is general and correct solutions whereas the 
other type is specific and incorrect solutions. Subject A, Subject B, and Sub- 
ject C reached the former type of solutions, whereas Subject D, Subject E, and 
Subject F reached the latter type. 
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Table 3. Categorization of subjects’ solutions 



Subject 


Factors 


Generality 


Validity 


Subject A 


AI-RH 


General 


Correct 


Subject B 


HF 


General 


Correct 


Subject C 


T 


General 


Correct 




T 


Specific 


Correct 


Subject D 


HT-HF 


Specific 


Incorrect 


Subject E 


HT-HF 


Specific 


Correct 


Subject F 


HT-HF 


Specific 


Incorrect 



5.2 Factors Deciding Subjects’ Performance 



Now we move to discussions on the relation between the characteristics of sub- 
jects’ behavior that were clarify in 4. and the solutions that each of the six 
subjects reached. Let us see Figure 7 again. The compression ratios of the sub- 
jects who reached general and correct solutions on the Rrst stage of chunking 
are smaller than the ratios of those who reached specific and incorrect solutions. 
This indicates that the subjects who got correct and general solutions made 
higher compression on the first stage of chunking; those who got incorrect so- 
lutions could not. This insists that even though the characteristics of searching 
experiments appear on the second and third stages of chunking, chunking on the 
basic first stage is crucial for organizing their behavior. 

Moreover Table 4 shows searched factors by each subject. The underlined 
factors indicate the factors related to the final solutions of each subject. The 
indexes, “o” , “x”, and , indicate systematically searched factors, randomly 
searched factors, and factors that were not searched, respectively. The system- 
atic search means that the subjects searched whole levels of the factors or some 
representative levels, such as levels that have the highest or lowest values. Table 
4 shows that two subjects. Subject B and Subject C, who reached correct solu- 
tions systematically searched two kinds of focused factors at once or conducted 
systematic search of other factors by expanded search out of focused factors. 
Subject A, even though he also reached correct solutions, did not conduct the 
expanded search. The reason is because the focused factors by Subject A, AI 
and RH, do not interact with other factors. On the other hand, every subject 
who reached incorrect solutions simply manipulated a single factor and could not 
conduct expanded search out of focused factors. Moreover some of them failed 
in systematic search of the focused factors. 
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Table 4. Searched factors and subjects’ performance 



Subject 


Fucused Factors 


Other Searched 
Factors 


Subject A 


AI o 

RH o 

AI-RH o 

HF o 


- 


Subject B 


T X 

T HF o 

HF AI RH X 

T o 


HF X 

AI-RH X 

AI-RH-TE X 

HT o 


Subject C 


T o 

T HT o 

HT HF o 

T o 

HF o 


T o 

TE o 



Subject 


Fucused Factors 


Other Searched 
Factors 


Subject D 


HT o 

HT o 

HF o 


T X 

T o 


Subject E 


ambiguous 
AI-RH X 

ambiguous 
HF X 

AI X 

TE RH o 


T X 


Subject F 


HT o 

HF o 

T X 

HT o 


HF X 



6 Summary and Conclusions 

In this paper, defining experimental processes that reflected the characteristics 
of both of FD experiments and TE search as searching experiments, we analyzed 
the ways of organizing behavior of searching experiments using SES. We also 
discussed the relation between the characteristics of the behavior and the per- 
formance of discovering targets. We understood, through the analysis of subjects’ 
behavior on SES, that they organized their behavior on the three levels. Unit, 
Series, and Block. Chunking on the second and third stages, constructing Series 
and Blocks, reflected the characteristics of behavior of searching experiments. 

In the latter part of this paper, we clarihed that subjects who reached gen- 
eral and correct solutions effectively performed chunking on the Rrst stage, which 
worked as the basis of organizing searching behavior, and systematically manip- 
ulated searched factors. One of our future works is to establish the ways for 
feedback of description of subjects’ experimental processes based on SES and to 
discuss its educational effects. 
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Molecular mechanisms of drug action are often based on an interaction of them with 
target macromolecules, such as proteins and nucleonic acids. The formation of ligand- 
target complexes is typical for biologically active compounds, including activators 
and inhibitors of various enzymes. Prediction of the dissociation constant (Kd) of 
protein-ligand complex is often used as a scoring function for the modeled complexes 
and there are many approaches in the field of prediction of such constant [1]. In the 
present work various parameters of protein-ligand complexes were used to predict 
Kd. These parameters can be quickly calculated immediately during docking 
procedure, which we usually used for complexes modeling. The artificial feedforward 
neural networks (AFNNs) were used as a mathematical approach to prediction of 
protein-ligand complexes Kd. In practice, neural networks are especially useful for 
classification and function approximation problems, which have a lot of training data. 
Neural networks are often used in situations where you do not have enough prior 
knowledge to set the activation function, as in case of the prediction of the protein- 
ligand complexes dissociation constant. 

The Kd values for 83 various complexes of biological molecule [2] were used in the 
present work. All of these complexes have crystallographic coordinates of 3D 
structures. This set of complexes was divided randomly into two subsets. The training 
set includes 68 points and the test one includes 15 points. Hereinafter Kd will appear 
as the predicted values. The crystallographic data for all complexes passed 
preliminary handling according the uniform scheme by using program suite Sybyl [3]: 

• rebuild hydrogen atoms in molecules; 

• remove crystallographic water molecules (except the case of HIV protease, where 
one molecule of water was accepted as an element of a ligand); 

• check and correct types of atom and bond; 

• solvate the complexes; 

• optimize the structure of complexes in the water environment. 

Estimation of following parameters were done for all of 83 complexes: 

1 . The number of atoms in target and ligand part of complex. 

2. The value of energy due to electrostatic interactions [3]. 

3. The attitude of the closed surface in a complex to the full surface which is 
accessible for water (sphere radius 1 .4 A) in unbound molecule. These parameters 
were estimated both for ligand and target parts of complex [4]. 

4. Analogous parameters estimated by using sphere with 0.5 A radius [4]. 

5. The changes of integral parameters of hydrophily and lipophilicity and the changes 
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of value of hydrophilic and lipophilic areas. These parameters were calculated by 
using the original program, based on a molecular lipophilicity potential [5]. 

So, we have got 15 independent variables, which were used to design statistical 
models for Kd predicting. The AFNN with one hidden layer and following set of 
activation functions (Fi, F2 and F3) were applied to the models design: Fi(x) = x; 
F2 (x) =sign(x)*ln(l+|x|); F3(x) =sign(x)*(e'’‘-l). Three models were considered: linear 
model (IN) and two non-linear models - combination of activation functions F2 and 
F3 (2N) and combination of activation functions Fi, F2 and F3 (3N). All models were 
constructed by using the original software [6]. Neural Network Constructor (NNC) 
was developed for creating non-linear models and for successfully solving statistical 
problems arising from different fields of knowledge, for instance, biochemistry. NNC 
v.3.01 is free software available on WWW. Some statistical characteristics of created 
models are shown in the table 1. In the table means the square of binary correlation 
coefficient, MSB - mean square error. Subscript "learn" corresponds to the tuning 
process, "control" corresponds to the "leave-one-out" procedure and "test" 
corresponds to applying of created model to the test sample. 



Table 1. Statistical characteristics of the models. 



The model 


n2 

^ learn 




p2 

^ control 


MSEcontroi 


p2 

^ test 


MSEtest 


Linear model 


0,634 


1,098 


0,571 


1,193 


0,686 


0,947 


2 neurons model 


0,743 


0,920 


0,582 


1,225 


0,730 


0,924 


3 neurons model 


0,837 


0,733 


0,705 


0,995 


0,806 


0,979 



As shown above, the using of simple linear model allow obtaining valid result for Kd 
prediction on the test set. Using 2N model gives only insignificant improvement for 
predictive ability. 3N model is much better if we take into account first five indexes 
(R\arn; MSEkarn; R^ontroi; MSEcontroi; Retest)- When we pay any attention to the last 
quality index (MSE,est) we can notice that the last model is not so good as would be 
desirable. It is necessary to note that the square of correlation coefficient on testing is 
more than the square of correlation coefficient on "leave-one-out" procedure. It can be 
explained by following reasons: 

• inaccuracy of Kd measurement; 

• inaccuracy of RSA method; 

• insufficiency of a sample; 

• non correct description of atom and bond types in using force field. 

There are several ways to improve the model quality: refinements of source data, 
robust estimating of model parameters or increase size and variety of the training set. 
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1 Introduction 

The identification of the start (onset) time of the quasi-periodic oscillation 
(QPO), which is called the Pi 2 pulsations in a magnetospheric physics, from the 
ground magnetic field observation is usually carried out by focusing on a wave- 
like component obtained by applying a linear band-pass filter [5,6]. When the 
background magnetic field (i.e. time-dependent mean value structure) and/or the 
amplitude of high-frequency components (i.e., time-dependent variance struc- 
ture) change rapidly around the initial period of Pi 2 pulsations, any linear 
band-pass filter, which also includes the procedure based on a simple modifi- 
cation of the wavelet analysis, always generates a pseudo precursor prior to a 
true onset time. In such a case, an accurate determination of onset time requires 
a nonlinear filter which enables us to separate only the wavy-like component 
associated with Pi 2 pulsations from the time- varying mean and/or variance 
structures with various discontinuities. In this study we introduce a locally fixed 
time series model which partitions the time series into three segments and to 
model each segment as the linear combination of several possible components. 
An optimal partition obtained by the minimum AIC procedure allows us to de- 
termine an onset time precisely even for the above-mentioned case. We illustrate 
this procedure by showing an application to actual data sets. 

2 Treatment of Rapid Decrease in Trend 

The time series Yi;Ar = [i/i, . . . ,2 /at] is a scaler observation which is the H com- 
ponent recorded by a magnetometer at the ground station [8]. We sometimes 
observe an extremely rapid decrease in the background magnetic field measured 
at the high latitude stations. A preparatory removement of such rapid change in 
the trend from the original observations enhances efficiency and accuracy in an 
estimation of parameters involved in describing a time series model, because an 
onset time determination in our approach is based on a representation of the time 
series by a flexible model with many unknown parameters. Prior to an analysis 
of an onset determination we therefore apply a detrending procedure which fits 
a parametrically described function /in(^) to where 0 is a parameter vector 
for representing ii„ that is a function of n. 
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The detrending procedure begins by examining a sequence of the hrst dif- 
ference of original time series and identifying intervals each of which is dehned 
by consecutive data points with the hrst difference value smaller than a certain 
threshold, ~^th- We denote the jth interval by Dj = [ij^A, ij,B] {j = 
where J is the number of intervals with a rapid decrease. A detailed examination 
of yn for n G Dj founds that a rapid decrease can be approximated by the hrst 
quarter of the cycle of a cosine function. Specihcally a function form for the jth 
rapid decrease, /^, is given by 

/n = [9j,A - 9j,B) COS ( — T^') + 9j,B for n £ Dj. (1) 

li„ for an interval between Dj and specihed by is simply given by a 

linear function: 



hi= ( ] [n - ij^s) + 9j,B for n^Cj, 
V ^j + l,A t‘j,B 



( 2 ) 



where Cj = {ij,B, ij+i,A)- for an interval before D\ is given by a constant: 
= gi A- Similarly, for an interval after Dj, i.e., Cj = fin is given by 

K = fif J.B • 

For given set of T)i, . . . , Dj , an optimal set of {gj^A, 9j,B) {j = , J) is 

easily obtained by applying the least squares ht. Actually a minor adjustment 
of a location of Dj itself is carried out by minimizing the squared residuals. 
Eventually fi„ is represented with a parameter vector 9 which consists of 4J 
variables: 



ih,A,9i,A), i‘ii,B,9i,B), 



i‘ij,A,9J,A), (ij,B,9J,B) ■ (3) 



As a result, a procedure for obtaining an optimal 9, 9* , turns out to become non- 
linear. The detrended signal, e„, is dehned by e„ = Vn ~ {n = 1, . . . ,N). 



3 Data Partition 

Suppose that a wave train of the Pi 2 pulsation is observed in = [ei, . . . , ejv], 
and denote its starting and ending points by -f 1 and k 2 , respectively. Accord- 
ingly, a total interval is divided into three sub-intervals: 

/(I) /(2) /(3) 

^ ^ ^ 

Ei:n = [ei, ■ ■ ■ \ Ck^+i, . . . ,Ck2 \ Ck^+i, ■ ■ ■ ,cn]- ( 4 ) 

A presence of the Pi 2 pulsations is assumed only for an interval . The Akaike 
Information Criterion (AIC) [1] for AICat, is given by 

AIC(Ai, *2) = AICat = AIC<^) -f AIC^^^ -f AIC(^\ (5) 

that is a function of k\ and k’z, where AIC'-™^ is the AIC for the mth interval 
[7]. The onset and offset time of the Pi 2 pulsations are given by the optimal 
dividing points, k\ and k^, respectively, which are determined by minimizing the 
AICat. 
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4 Time Series Model for Each Segment 



Suppose that the time series e„ for the m-th interval is given by the following 
observation model 



= 



s(™) 



.,(™) ,„(™) 



(m = 1,2, and 3), 



( 6 ) 



where is a stochastic trend component and is assumed to follow a system 
model [4] 

is the observation noise, corresponds to the signal associated with the 
Pi 2 pulsations which is assumed to be a stochastic process with colored power 

spectrum. Obviously, = 0. 

(2) 

In this study s„ ^ is furthermore decomposed into the quasi-periodic oscilla- 

( 2 ) 

tion (QPO) component q„ and autoregressive (AR) component Tn'. Sn' = 
qn and r„ are are modeled by 



= 2cos(27r/c)g„-i - g„- 2 +<, <~lV(0,Tg), (8) 



+<, < ~ N{Q,t^), (9) 

i=i 

respectively, fc corresponds to a reciprocal of a period of the Pi 2 pulsations 
in unit of data points. In this study it is treated as unknown parameter and 
need not be given beforehand. The presence of system noise in (8) makes the 
cycle stochastic rather than deterministic, and thus the QPO model allows us 
to represent a periodic component of distinct frequency fc with stochastically 
time- varying amplitude and phase [3] . 

The AR component is introduced to represent the locally stationary com- 
ponent in Sn- Namely, whereas describes a signal with an eminent peak in 
power spectrum (i.e., line spectrum), r„ accounts for a signal having a contin- 
uous spectrum. Several trials with changing Jar in applications founds that a 
simple treatment of hxing Jar = 4 is sufficient in our study. 



5 Parameter Estimation Procedure 

The time series model presented in previous section can be formulated by a state 
space model (SSM) [2] as follows: 
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For example, the time series model for can be represented by the SSM in 
which the corresponding vectors and matrices are 



= [ 1 , 0 , 1 , 0 , 1 , 0 , 0 , 0 ], 



/2-1 

1 








C -1 
1 




V 




Oi 02 O3 O4 
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.(2) = 



T(2) 

Wn 

< 

vL 



where C = 2 cos (27 t/c). Flere the empty entries of and are all zero and 
~ N{0,R^^'>) with a diagonal variance matrix of = diag{Tf Tp ,t^). 
An optimal estimation for t„ , q„, and r„ is given by the estimated x„ ^ that is 
obtained by the Kalman hlter and smoother [2]. Flere Tp , and are un- 
known parameters to be optimized. Then the time series model for involves 

nine unknown parameters: 



= [(T^’‘'^\Tt’^^\fc,Tp,Tg,ai,a2,a3,a4]'. 



( 12 ) 



The optimal can be determined by minimizing the log-likelihood, 

= logp(£;fej+i:feJA(2)), where Ek^+i:k 2 = [efei-i-i, • • • , [4], The AIC 

value for , AIC^^\ is also dehned by 



AIC(^) = -2£(A<2)*) 2dim (A<2 



(13) 



Similarly, AIC'-^^ and AIC'-^^ in (5) are also dehned. 



6 Result and Summary 

Fig. 1 shows one of results of the decomposition obtained by applying our pro- 
cedure to data sets in each of which a typical Pi2 pulsation is observed. The 
data set that we examined is the H component measured at Kotel’ney (Russia) 
from 1996/May/26 16:10.00-17:10.00. The sampling time is a second, and thus 
TV = 3, 600. The two vertical lines indicate the estimated and respectively. 
The three lines are the estimated r„, q-n, and original observation , from the 
above, respectively. The horizontal arrow illustrates that the minimum AICat 
procedure hnds an optimal k\. The thick line is the estimated trend component: 
IJ-n + For this case, three rapid decrease are identihed; namely, J = 3. 
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Fig. 1. Decomvolution of the magnetic field data. 



The advantages of applying our procedure are summarized as follows. First, 
our model for decomposition is robust to a rapid change in trend, and then it 
gives us a good separation of the Pi2 wave component. Second, the onset time 
can be objectively determined by minimizing an information criterion, AICat. It 
turns out that our method is free from the ambiguity of onset time determination. 
Finally, our procedure is fully automatic. 

Acknowledgments. We thank all members of the 210° magnetic meridian 
network project (P.L Prof. Yumoto, Kyushu Univ.). The author thanks to Mr. 
Uozumi for his help to select the data sets. 
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1 Introduction 

In discovery with real data, one is always working with approximations. Even 
without noise, one computes over a finite set of unequally spaced numbers, ap- 
proximating one’s values with this set. With noise, the values are even more 
uncertain. Along with the ambiguity of the exact values of numbers, there exists 
Occam’s razor to prefer simpler equations. That is, if a straight line will explain 
the data, then that is generally thought preferable to equations of higher power. 
It should be noted that this preference is a choice, however, and does not always 
work [8]. The preference needs to be codified to be automated, but codifying 
simplicity is not straightforward [8]. Finally, there needs to be a quantitative 
way to measure the goodness of an equation after it is chosen. Function finding 
is numeric induction, so there can never be certainty. But assigning a value to 
the inductive support provides a way to compare results across tasks. Thus dis- 
covery of functional forms can be divided into three tasks: 1— choosing a search 
technique to find the set of best equations within the limitations of finite preci- 
sion arithmetic and noise; 2— choosing from among the best equations based on 
some criteria that encodes preference; and 3— choosing a metric for the inductive 
support of the found equation 



2 Choices 

I. Choice of Search Technique: Equation Signatures 

One way to search for equations is with regression, but this confounds 
the ambiguity of the coefficients with the ambiguity of the type. However, if the 
equation type is known, finding the coefficients is an already solved problem. So 
it is sufficient to search for the type. The method to search for types is based on 
equation signatures [3] [4]. Definition Equation Signature: an equation signature 
is a property of an equation type that is independent of its eoeffieients and that 
ean he used to identify its type uniquely. It should be noted that searching for 
types using equation signatures is less error prone not only because only one thing 
is being searched for at a time, but because signatures are independent of each 
other and can therefore be productively combined. Some signatures are [3] [4]: 

1.0 Linear Equations The signature of a linear equation measures flat- 
ness. The general linear equation is a hyperplane which has the property that 
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a unit normal constructed on one side of the plane is equal to a unit normal 
constructed anywhere else on the same side of the plane. So the variance in the 
unit normal provides the signature. It is a metric that tells one how close the 
data is to being a linear equation in whatever dimensional space one is working 
in. It also makes it easier to mitigate the impact of noise by choosing points far 
apart for the computation of the unit normal. 

2.0 Quadratic Equations The general equation of the second degree 
in two dimensions is a conic section which consist of the circle, the ellipse, the 
hyperbola, the parabola, and degenerate forms. In its most general form, the 
equation of the conic section is ax“^ + 2hxy + + 2dx + 2ey + / = 0, where the 

more complicated versions of the equation occur when the conic section is rotated 
and/or translated from the origin. Two example signatures for 2-D quadratics: 
1.0 j-invariant 

A pencil of lines consists of all lines that can be drawn through a given point. 
The anharmonic or cross ratio of a pencil of lines, whose sides pass through four 
fixed points of a conic, and whose vertex is any variable point of it, is constant 
[10]. Thus, if V is any point on a conic and A, B, C, D are any four other points 
on a conic section, as V is moved around the conic section and A,B,C,D stay 
fixed, the ratio stays the same. See Figure 1. Any five points may be chosen 




Fig. 1. The anharaionicor cross ratio of Vi relative to A, B, C, D is equal to the anhar- 
monic ratio of V 2 relative to A, B, C, D. 



for the calculation. For example, they may come from different branches of a 
hyperbola. This ratio can be computed in terms of distances or slopes [1]. If V 
is a point on a conic and A, B, C, D are four other points on a conic section, and 
a is the slope of VA, (3 is the slope of VB, 7 is the slope of VC, 6 is the slope of 
VD, then the anharmonic or cross ratio a is: a = While the ratio is 

constant, it is dependent on the order in which the four fixed points are taken. 
Depending on the order of selection of the points, the ratio can be calculated in 
twenty-four ways to yield six different ratios [1]. Fortunately, there is a function 
of the anharmonic ratio that is independent of the order of the points. This is 
the j-invariant [7] which is j{a) = ((a-i/2)^-i/4)^ Once the four points have been 
selected, this constant and remains the same constant under any translation or 
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rotation of the axes. If a different set of four points is selected it is also constant 
but takes on a different value. 

2.0 Linearity of the parallel chords 
As first proved by Appolonius in the third century B.C. [5], a conic section 
has the property that the midpoints of any set of parallel chords through it 
are collinear [12]. So the collinearity of the midpoints of parallel chords is a 
signature of the general equation of the second degree in two variables. The test 
for collinearity can utilize the signature of the linear equation defined above. It 
is unlikely that a given set of points will be lined up so that a set of parallel 
chords will have both points of intersection with the conic section on each chord. 
However, it is possible to pick a direction and create parallel lines through each 
of the other points in the data set. Then it is only necessary to know the other 
point of intersection with the conic section to enable calculation of the midpoint. 
Pascal’s hexagon theorem run backwards [4] is used to find the other point of 
intersection. Then calculation of the midpoint of each chord is straightforward. 
Thus the algorithm to determine if a point set is a conic section is as follows: 

Pick two points from the dataset; compute their slope and calculate their 
midpoint. 

Create parallel chords through the other points in the dataset. 

Run Pascal’s hexagram construction backwards to obtain the other point 
of intersection of each chord with the conic section. 

Compute the midpoint of each chord. 

Use the linear equation signature to determine if these points are collinear. 

Figure 2 is an example of this using case 105 from the function-finding dataset [11]. 




Fig. 2. Midpoints of a system of parallel chords for case 105 of the function-finding 
dataset [11]. 



More signatures for quadratics in two and higher dimensions are given in [4] 
One particularly nice attribute of equation signatures is that they make pos- 
sible the search for higher order equations with lower order methods, e.g. straight 
lines [4]. Another nice feature of this method is that one can define measures 
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that mean the same thing from one problem to another. Thus it becomes possi- 
ble to list the set of good equations. This is important because noise may make 
it impossible to determing unambiguously the best equation. This leads to the 
necessity of the next choice. 

II. Choice of Equation: a Priori Ordering There are many ways 
to select an equation from a group of equations. Dorothy Wrinch and Harold 
Jeffreys [13] argue for an a priori Ordering of Equations. According to them, a 
simple law is d priori more probable than a complex law. They don’t require the 
ordering need to be strictly monotonic; it can have branches and loops. Their 
concept of an explicit choice mechanism has the advantage of showing clearly the 
preference of one equation over another. Ordering all the equations of physics is 
a daunting task. But it is tractable to create an ordering within a given class of 
equations. This is the approach suggested here. The functions are divided into a 
number of generic types. Within each type the equations are grouped into equiva- 
lence classes. Take the simple generic type: y*' = +b The goal is to define an 
ordering based solely on values of ki and k 2 that would cover all possibilities and 
would be intuitively acceptable. Let the exponents be expressed by fractions j/l 
where j, and I are non-negative integers, with j and I relatively prime. Then let 
the sequence be the non-decreasing series of j /I where the value of the sequence 
at any point is equal to the sum of the numerator and denominator i.e. j + 1] and 
within each equal value of the sequence, the j/l are ordered by increasing 1. The 
first few terms of this series are: 0/1, 1/1, 2/1, 1/2,3/!,.... Their corresponding 
sequence values are: 1, 2, 3, 3, 4.... To find the sequence number of any rational 
number, reduce it until the numerator and denominator are relatively prime, 
then add the numerator and denominator. The result is the sequence value of 
the number. The simplicity value (S-value) of an equation will now be defined to 
be the sum of the sequence values of its exponents after the original exponents 
have been converted to positive whole numbers. Where composite terms exist, 
e.g. xy, sum the values of the exponents, then compute the S-value of the re- 
sult. Is this a reasonable grading? Qualitatively yes. Every equation can appear 
since every rational number can be expressed this way. Lower degree equations 
have lower values than higher degree equations. More complex equations have 
higher values. Now it is possible to compare equations by their S-values. S-value 
differences of zero or one may not mean much, but larger differences clearly do. 
This is an ordering for one class of equations. Others are possible. The impor- 
tant point is that there is some predefined ordering. Eor more complex classes of 
equations, other encodings are needed. Once there is an ordering, the best can be 
chosen, but support for it needs to be determined which leads to the next choice. 

III. Choice of Evidential Support Method: Concinnity 

There needs to be a way to quantitatively measure the goodness of a 
given equation. This is numeric induction, so there can never be certainty. The 
eliminative and variative method originally proposed by Erancis Bacon [9] [2] is 
used here. This method uses a series of tests of great variety that are designed to 
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eliminate possibilities. As more and more tests of different types are passed, the 
result becomes harder to refute. After testing, theories can be graded ordinally, 
by the number of tests they passed. Define: Concinnity Index This is the 
number of tests passed by the equation, out of the number of tests performed. 
This index is not a single number, and can’t be if you want to emphasize its 
dependence on the number of tests passed. An individual test gauges support 
for an equation just according to the property the particular test measures. 
By combining many tests, a harmonious picture of the validity of the equation 
emerges. This is also a way to say how far the chosen equation is from the next 
best. If the best can pass ten tests while the next best can only pass one, this 
provides a measure of the distance between the equations. 

Component tests of the Concinnity Index include two types of tests: tests 
for randomness of residuals, and other tests based on the equation type. Due to 
much study of random number generators, there are many tests for randomness 
[6]. Tests include the serial correlation test, a binning test, a run test. Tests 
based on equation type deal with signatures. If the equation was not found by 
the signature method, signatures provide a good test. Signatures can also be 
combined to make a stronger test. For example, the variables in an equation 
can be linearized and the resulting points tested for linearity. It should be noted 
that all of these tests can and should also be performed for points outside the 
interval that the equation was found in. Additionally, point prediction outside 
the interval is a crucial test. 

In summary, choice has been present at every stage of this process. 
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Recent advance of spectroscopic instruments has allowed us to obtain a large 
amount of spectral data in machine readable forms. High resolution molecular 
spectra contain abundant information on structures and dynamics of molecules. 
However, extraction of such useful information necessitates a procedure of spec- 
tral assignment in which each spectral line is assigned a set of quantum num- 
bers. This procedure has traditionally been performed by making use of regular 
patterns that are obviously seen in the observed spectrum. However, we often 
encounter complex spectra in which such regular patterns may not be readily 
discerned. The purpose of the present work is to search for new methods which 
can assist in assigning such complex molecular spectra. We wish to devise com- 
puter aided techniques for picking out regular patterns buried in a list of observed 
frequencies which look like randomly distributed. We hope that we may depend 
on great computational power of modern computers. 

Previously [1,2], we have proposed a method, which we tentatively refer to as 
’’second difference method” and suggested that this technique may be developed 
as a useful tool for analysis of complex molecular spectra. This method has 
been tested with success on the observed spectrum of a linear molecule DCCCl 
[1]. We have also presented a further test using an artificial data corresponding 
to an infrared spectrum of a linear molecule HCCBr [2]. However, we recently 
encountered a set of data for which the method in the original form did not work 
well. 

The present article describes a revised algorithm which was developed as a 
remedy but proved to be a substantial improvement from the original one. The 
revised algorithm is based on the same basic assumptions as for the original one. 

1) A complex spectrum is formed as a result of overlap of many spectral 
series, each of which has a simple structure. 

2) The frequencies of spectral lines belonging to a series may be repre- 
sented to a good approximation by a quadratic function of a running 
number. 

The second assumption means that if we let /(fc), fc = 1, 2, 3, . . . be frequencies 
of spectral lines belonging to a series, the second difference 

Z\2(fc) = /(fc + 2)-2/(fc+l) + /(fc) (1) 

S. Arikawa and S. Morishita (Eds.): DS 2000, LNAI 1967, pp. 252-254, 2000. 
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would be almost constant independent of k, and therefore the third difference 
A%k) = f{k + 3) - 3/(fc + 2) + 3/(fc + 1) - f{k) (2) 

would be very small for all k. 

We coded a FORTRAN program briefly described as follows: 

(1) List of observed frequencies is read in. 

(2) 3-membered chains are generated. A 3-membered chain is defined 
as an array of three frequencies (/i,/ 2 ,/ 3 ) chosen from the list of 
observed frequencies. To save computation time and memory, the 
upper and lower limits of the second difference /a — 2/2 + /i as well 
as those of the first difference /2 — /i may be given as preset values 
to restrict the 3-membered chains. 

(3) From n-membered chains, n-|-l-membered chains are generated. The 
method for this will be discussed below in some detail. 

(4) (3) is repeated until the chain length reaches a preset value. 

(5) For each chain with the preset length, frequencies are least-squares 
fitted to a polynomial of a preset order, and standard deviation is 
calculated. 

(6) Chains and their standard deviations are listed in the order of as- 
cending standard deviation. 

In (3), the method to extend an n-membered chain (/i, . . . , fn-2, fn-i, fn) 
to an n -I- 1-membered chain is as follows. We calculate 

/pred = /„-2-3/„_i + 3/„. (3) 

If we find in the list of observed frequencies a frequency which falls between 
/pred — 2\ /allow and /pred + L\ /allow, It is added to the tail of the n-membered 
chain to generate an n -I- 1-membered chain. This method increases the chain 
length by one in such a way that the second differences /„ — 2/„_i -|- fn -2 and 
fn+i — ‘^fn + fn-1 do not differ by more than Z\/aiiow, which is a preset value. 
The A/allow value will be set to such a magnitude that in a true spectral series 
/(I), /(2), . . . the absolute value of the third difference 

A^k) = f{k + 3) - 3/(fc + 2) + 3/(fc + 1) - f{k) (4) 

for any k would not exceed A/aiiow- 

The spectrum for which the second difference method in the original form was 
unsuccessful is that of trans-glyoxal. It has been observed by Professor Kato’s 
group in Kobe University by means of Doppler-free two-photon absorption spec- 
troscopy at a very high resolution [3]. The profile of the spectrum is as follows. 
It consists of about 1700 spectral lines distributed in the region 22187-22215 
cm“^. The lines belong to 23 major series, each containing more than 50 lines, 
as well as to a large number of minor series. 
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We applied the present method to this spectrum under the following con- 
ditions. We selected 3-membered chains for which first and second differences 
were between 0 and 1.2 cm“^ and between —0.015 and 0 cm“^, respectively. 
The 2\ /allow value was set to 0.003 cm“^. The chain length was preset at 15. 
The frequencies of the resulting chains were least squares fitted to polynomials 
of the 4-th order. Then, 7084 chains were obtained with standard deviations of 

0. 000070. cm“^ at the smallest and 0.006340 cm“^ at the largest. We inspected 
240 15-membered chains with the smallest standard deviations whether each 
chain was a part of a true spectral series, and found that 130 of them corre- 
sponded to true spectral series. There were 75 chains for which only one line was 
erroneously merged. Similarly, 28, 5, and 2 chains had 2, 3, and 4 erroneously 
merged lines. 

Except for a few cases, it was found that the number of the erroneously 
merged lines was at most two. Even if there are merged two erroneous lines, it 
would not be too difficult to arrive at the correct assignment starting from them. 
Out of 23 major spectral series, 19 were detected by the chains with less than 
3 erroneous lines among the 240 chains. These results indicate the usefulness of 
the revised algorithm. 
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1 Introduction 

Akebono is a Japanese scientific satellite, which was launched in 1989 for precise 
observations of the magnetosphere. Akebono is successfully operated for more 
than 11 years and the accumulated data amounts to about 1 Tbytes of digital 
data and about 20,000 audiotapes of analogue data. The VLF instruments on- 
board the Akebono satellite are designed to investigate plasma waves from a few 
Hz to 17.8 kHz [1,2]. In the field of conventional geophysics, research is mainly 
performed by the following way; (1) a scientist discovers observation result as 
an evidence of a new theory, or (2) a scientist tries to explain a very unique 
observation result theoretically. In the analysis of physical phenomenon from 
data observed by one satellite, it is quite difficult to tell whether we see a tem- 
poral change, a spatial change or mixture of them. Hence it is indispensable for 
investigating these phenomena to examine as large amount of data as possible, 
although they can be clarified to some extent by event study. Our aim is to de- 
velop new computational techniques for extracting the attributes of the plasma 
waves from the enormous data sets of Akebono and to discover epoch-making 
knowledge. 



2 Extraction of the Attributes of Plasma Waves 

In the past eleven years, so many kinds of plasma waves were observed in the 
Earth’s magnetosphere by the Akebono satellite. Some of them are artificial 
waves propagating from the ground and the others are natural waves generated 
in the space plasma. As the spectrum of each wave is attributed to various gen- 
eration mechanisms and propagation modes, the generation/propagation mech- 
anisms of these waves reflect the plasma environment around the earth, which 
depends on altitude, latitude, and a variety of geophysical parameters such as 
solar activity, geomagnetic activity, season, and local time. For example, elec- 
trostatic broadband low frequency noise is one of wave phenomena frequently 
observed in the auroral region. Using all wave data obtained by Akebono for 
nine years, we could get many new findings on the phenomenon; the wave is 
distributed in the limited latitude region, the region is extended toward the 
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lower latitude while the geomagnetic activity is higher, and the intensity be- 
comes largest in winter and weakest in summer [3], These findings show that it 
is quite valuable in deriving a dynamic structure of the magnetosphere to inves- 
tigate these waves and clarify their characteristics. As the total amount of the 
data obtained by Akebono is huge, it is necessary to develop a computational 
technique that efiiciently classifies the wave spectra. We made an attempt to 
classify the plasma waves in a systematic way. 

Assuming that the wave intensity of each wave mode is represented by a 
function of multi-dimensional parameters, we firstly examine the occurrence fre- 
quencies of the wave intensity versus principal parameters. In Fig. 1(a), we show 
an example of contour map which is used for the classification of wave modes. In 
the figure, the vertical axis indicates wave intensity and the horizontal axis indi- 
cates invariant latitude of the observation point, and the occurrence frequency 
of the wave intensity at 5.62 kHz is represented by contour level. We find that 
the distribution of the occurrence frequency is then divided into some clusters, 
which correspond to the different kinds of the wave modes such as auroral hiss, 
chorus, magnetospheric hiss etc., which are typical wave phenomena in the fre- 
quency range around several kHz. We identify the wave mode for each cluster by 
sampling some data included in it. Using this method, we successfully extracted 
several kinds of wave modes. 

As a next step, parameter dependence on the spatial and temporal distribu- 
tion of the plasma waves extracted from more than 27,000 data sets are examined 
statistically. Fig. 1(b) shows the spatial distribution of chorus emission extracted 
by this method. In the figure, the radius of the circle indicates the radial distance 
from the Earth’s center and magnetic local time is taken along the circumfer- 
ence. We can find that chorus emission is usually observed larger than the radial 
distance of 4 re, where is the radius of the Earth, in the magnetic local time 
range from 0 to 18. The distribution becomes broader and farther in the dayside 
than in the nightside. This result is a quite interesting finding from a geophysical 
point of view. 



3 Discussion 

There are a variety of plasma waves, and the multi-dimensional parameter de- 
pendence of these waves reflects the geophysical rules that control the plasma 
environment around the Earth. Data sets continuously obtained by Akebono 
for more than eleven years contain a lot of information to discover these rules, 
whereas increase of the volume of data sets makes us impossible to apply the 
conventional analysis techniques. In the present paper, we attempt to extract the 
particular phenomena in a systematic way using the occurrence frequency of the 
wave intensity. We select the principal parameters arbitrarily from a scientific 
point of view and we found that our approach is generally agreeable. However, 
it is still at a preliminary stage and there will be two important items to be 
developed for the future work; an algorithm for clustering the phenomena by 
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using the information of the wave spectra, and an auto-compression algorithm 

of unnecessary dimensions among the many parameters. 

References 

1. Kimura, L, Hashimoto, K., Nagano, L, Okada, T., Yamamoto, M., Yoshino, T., 
Matsumoto, H., Ejiri, M., and Hayashi, K.: VLF Observations by the Akebono 
(EXOS-D) Satellite, J. Geomagn. Geoelectr., 42, (1990), 459-478 

2. Hashimoto, K., Nagano, L, Yamamoto, M., Okada, T., Kimura, L, Matsumoto, H., 
and Oki, H.: EXOS-D (AKEBONO) Very Low Frequency Plasma Wave Instruments 
(VLF), IEEE Trans. Geoelectr. and Remote Sensing, 35, (1997) 278-286 

3. Kasahara, Y., Hosoda, T., Mukai, T., Watanabe, S., Kimura, L, Nakano, T., and 
Niitsu, R., ELF/ VLF Waves Correlated with Transversely Accelerated Ions in the 
Auroral Region Observed by Akebono, submitted to J. Geophys. Res., (2000) 




Fig. 1. ( a) Classification of wave modes and (b) an example of spatial distribution of 
extracted wave modes. 
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1 Introduction 

Investigation of space environment around the earth has become a critical issue. 
In addition to the artificial waves propagating from the ground, a variety of 
natural plasma waves generated in the magnetosphere are detected by scientific 
satellites. As the characteristics of these waves reflect the plasma environment 
around the earth, it is important to measure the propagation (wave normal) 
directions of these waves. Since signals observed by a satellite may be a mixture 
of the waves from multiple directions, we must assume unknown parameters of 
these waves without prior information. In the present study, we propose a new 
direction finding method derived from the concept of energy function. 



2 WDF Method with a Gaussian Distribution Model 



The wave distribution function (WDF) is derived from the concept that observed 
signals can be defined as a distribution of the wave energy density relative to 
the direction {6, (p). The WDF is related to the spectral matrix by the 

following equation, 

P‘2tT p7T 

= Y / / aij{u,0,(p)F{u,0,(p)sm0d0d(p, (i, j = 1, • • • , 6), (1) 

^ Jo Jo 



where Sij are the elements of the spectral matrix calculated from electric and 
magnetic wave fields at the observation point, and aij{u, 0, (p) are the integration 
kernels theoretically determined by the plasma parameters. Using the known 
parameters Sij and a^, we can estimate the function F by solving the set of 
integral equations (1), but it is an ill-posed problem whose solution may not 
be unique. In the present study, we propose a Gaussian distribution model in 
which we assume that observed signal consists of m clusters of waves whose 



distribution function F is represented as ct; exp 




where a; is the 



intensity at the center of distribution {0i,<pi), di represents the angular extent of 
the distribution, and dio{0,(p) is the angle between {0,<p) and {0i,(pi). Unknown 
parameters {0i,(pi,ai,di) are determined by the non-linear least-squares fitting 
method. 
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In order to get an adequate solution, the number of arrival waves m and 
the initial values for must be determined. We assume that the 

observed spectral matrix S^obs is composed by a combination of the estimated 
spectral matrices Si^st 

n 

*^obs — ^ ^ ^lest (2) 

1=1 

where n is the total number of discrete points in (0, (p) space used for the initial 
value determination. In the condition that no variables {x\ ■ ■ -Xn) are negative, 
it is confirmed by the simplex method that there is no solution which satisfies 
(2). We define the following energy function E as a dispersion of the ratios of 
corresponding elements of both sides in (2), 

(3) 

V ‘^obs / 

Applying the steepest descent method combined with the random search method, 
we can solve the dynamics system of E and obtain an optimum solution of 
{xi ■ ■ ■ Xn) which realizes the minimum E. This solution is referred to as the 
approximate distribution of the WDF for the determination of the initial values 
of the parameter fitting. This pre-processing is useful for examining the validity 
of the final solution reconstructed by the Gaussian distribution model. 

3 Simulation 

The performance of the WDF method with the Gaussian distribution model is 
evaluated using the computer-generated spectral matrices 5^ calculated by (1) 
from given wave distribution functions F{uj,6,(p). In the simulation, the source 
wave is assumed to be whistler mode wave at lOkHz, and the plasma frequency 
and the cyclotron frequency of electron are 60kHz and 400kHz, respectively. 
Several cases are examined by varying the distribution of arrival waves. 

The given and reconstructed wave distributions in the case where F is com- 
posed of two Gaussian distribution with the parameters = (30°, 

60°, 3, 10°) and (02,(p2,ct2,d2) = (60°, 200°, 1, 30°) are shown as Fig. 1. It is 
found that the wave distributions are successfully reconstructed, and the fitting 
errors are small enough. In the case where F is assumed to be distributed along 
the resonance angle of whistler mode wave, the given and approximate wave dis- 
tribution at the pre-processing stage are shown as Fig. 2. In this case, the wave 
distribution is practically reconstructed by the combination of several Gaussian 
distributions at the pre-processing stage, although the given distribution is non- 
Gaussian. 

4 Results 

The direction finding method using the wave distribution function with the Gaus- 
sian distribution model is proposed. It is found that the wave distribution is well 
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Given distribution Reconstructed distribution 



Fig. 1. Simulation (a); two Gaussian distributed sources. 





Given distribution Reconstructed distribution 



Fig. 2. Simulation (b); the distribution along the resonance angle of whistler wave. 



reconstructed by a combination of the Gaussian distributions. It is remarkable 
that the proposed method is applicable to all considered cases where the arrival 
wave is a point source or a combination of extended sources. The amount of the 
calculation time is small enough for the practical use. 

This method is applied to the data observed by the Akebono satellite, such as 
Omega signals, whistlers, chorus emissions. The derived wave normal directions 
are in the acceptable range from theoretical viewpoints. 
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The famous mutagenity data set of 230 compounds [1] encoded by topological 
descriptors [2] was processed by GUHA method. 

Basic ideas of GUHA (General Unary Hypotheses Automaton) method were given 
in [3] already in 1966. The aim of GUHA method is to generate hypotheses on 
relations among properties of the objects which are in some sense interesting. 

The hypothesis is generally composed of two parts: antecedent (A) 

and succedent (S). Antecedent and succedent are tied together by so called 
generalized quantifier, which describes the relation between them. Given antecedent 
and succedent, frequencies of four possible combinations can be computed and 
expressed in compressed form as so called four-fold table (ff-table): 



ff-table 


Succedent 

(S) 


Non(succede 

nt) 


Antecedent (A) 


a 


b 


Non(antecedent) 


c 


d 



Here a is the number of the objects satisfying antecedent and succedent 
(implication is valid), b is the number of the objects satisfying antecedent but not 
satisfying succedent (implication is not valid), etc. 

The basic generalized quantifier defined and used in GUHA is given by Fisher 
exact test known from mathematical statistics, and relative frequency PROB=a/(aH-b). 

In our previous paper [4] we used GUHA for Structure-Activity Relationships 
(SAR) with the same original data [1] as in this work, encoded by original fingerprint 
descriptors. Topological descriptors [5] for coding the same data have been used in 
this work. Mutagenicity data set (230 compounds) [1] was given in two tables. Both 
coding methods can describe compounds in the same manner, therefore there can be 
redundancy in the data. This redundancy is inconvenient in the search for Structure- 
Activity Relationships (SAR), but the used method (GUHA) enables the choice of the 
best of redundant variables for given dependency relation. 
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We presented a number of hypotheses [5] discovered by GUHA+/- [6]. The 
following two hypotheses are interesting from the point of view of toxicology: 

• High number of ortho-peri carbon atoms ^ high mutagenicity. 

• High Balaban index AND low total negative p. ^ low mutagenicity. 

(p, is number of valence electrons of atom i) [2] 

The first hypothesis is identical with the hypothesis in our previous paper [4], 
where the data was encoded by our fingerprint descriptors. 

The next step should be interpretation of these hypotheses and generation of more 
precise hypotheses including three or more variables in antecedent, in accordance 
with knowledge of the variables. 

Our assumption, that GUHA can be used in the search for interdependencies, 
seems to be right. We tried to draw dependency graphs of the best hypotheses and 
they confirmed the trends. 

The applicability of GUHA to coding by fingerprint descriptors [4] and topological 
indices [5] in SAR has been demonstrated. GUHA is able to proceed the whole 
original data set. 

According to the theory of global interpretation of multiple hypotheses testing (see 
[7] chapter 8) the global significance of our results was considered. From this point of 
view the results as a whole can be interpreted as sufficiently reliable knowledge on 
the universe from which the data form a random sample. 
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1 Introduction 

The problem about brand choice or brand switching has been discussed for a 
long time in a marketing research field [1][2][3][6]. They focus on revealing a 
probability of brand switching and what factors are related to the brand switch- 
ing. However, brand choice behavior of individual customer has been neglected 
in most of existing literature. In this study, we consider the problem of finding an 
optimal distribution strategy of discount coupon that determines to which cus- 
tomers and at what price coupons should be distributed, using detailed customer 
information. 

Such a way to distribute a discount coupon based on customer information is 
available on today’s new technology. For example. Pharma, which is one of the 
biggest drugstore chains in Japan, developed advanced POS terminal that has a 
function to issue a customized coupon [5]. Furthermore, in today’s virtual store 
on the Internet it is possible to set a different price on a different customer. 

In view of this, we formulate the problem as two different optimization prob- 
lems by paying a special attention to the interpretability of rules. We then pro- 
pose a heuristic method for solving it. Using huge sales data of the biggest drug- 
store chain in Japan, we have applied the algorithm to the case of detergents. 
Using the method we found some interesting rules which can be implemented in 
real business. 

It should be pointed out here that the interpretability of the rules obtained 
from the solution is extremely important because even if the solution is good, 
the rules obtained will never be implemented in actual business action plan, if 
they are not interpretable. Rules generated by decision trees such as See5 [7] are 
usually difficult to interpret from practical point of view. 
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2 Problem Formulation 

We formulate the problem of finding an optimal distribution strategy of discount 
coupons as the following two optimization problems using different criteria under 
the budget constraint. 

(1) Maximization of Customer Gain: The first problem is to maximize the in- 
crease of the number of customers (called customer gain) who will newly 
begin to buy the brand with the coupon under the budget constraint. Let 
q denote the number of customers who will purchase the target brand by 
using a coupon at certain discounting price let go denote the number of 
customers who will purchase the target brand even if the target brand is 
sold without discounting. Then the problem is to select customers so as to 
maximize q — qo under the budget constraint. 

(2) Maximization of Effective Cost: In the second problem, we take into the 
marginal utility of the discount coupon. Suppose that the number of cus- 
tomers who will buy a brand is Np at the discount price p. Among Np 
customers there may be those who are willing to buy the brand even at a 
price p' higher than p. Let Np' be the number of such customers. Thus, the 
cost, [p' —p)Np> can be regarded as the wasted cost. Subtracting the wasted 
cost from the cost actually incurred represents the effective cost. The second 
problem is to maximize the effective cost under the budget constraint. 

Suppose we have n attributes denoted by Ai, A 2 , . . . , A„. For each attribute 
Ai, it is assumed that the domain of Ai (denoted by dom(Aj)) is appropriately 
discretized, i.e., dom(Aj) = { 0 *^, 0 * 3 , . . . , where Si = dom(Aj). We assume 
in this formulation that the discount price which is a decision variable is a discrete 
variable. Associated with attribute Ai, let Rt be a pair of (Si,dj) where Si C 
dom(Aj) and dj is the discount price. The meaning of Ri is that we distribute a 
discount coupon at the price dj to the customers whose attribute Ai takes the 
value in Si. Then the rule i? is a union of Ri expressed as R = UtiRi- 

Namely, the customers selected by the rule is those for which there exists at 
least one attribute such that attribute A, of a customer takes the value in Si 
for some i. This rule is called ID rule since we construct rules by considering 
attributes separately. We can easily extend it to 2D rule. Then letting ui{R) 
denote as the number oi q — qo that satisfy the rule R, the first problem is 
formulated as 



PI : maximize Ui{R) subject to C{R) < B, 

where i? is a given budget and C (R) is the total cost incurred by implementing 
rule R. Letting U 2 {R) denote the effective cost for a given rule R, the second 
problem is formulated as 

P2 : maximize U 2 {R) subject to C{R) < B. 

^ We define a discounting price by two different ways, that are relative discounting 
price in time sequence and relative discounting price between two brands. 
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3 Greedy Algorithm 

Although NP-completeness is not yet proven, problem PI seems difficult. Prob- 
lem P2 is NP-complete because the simpler version where there exist only two 
attributes and there is no redemption cost is equivalent to the one discussed in 
[4] which was shown to be NP-complete. Hence we propose a heuristic algorithm. 
Let us first consider the problem PI for ID rule. Let do, di, . . . ,dm be the set 
of possible discount values, where do means no discount. For attribute Ah, and 
for Qi G dom(A/j), let us focus on the customer group Gi whose Ah takes Oj. Let 
9i = \Gi\. Let pij for 0 < j < m denote the probability that a customer in Gi 
purchases the brand when the discount is dtj. Then the expected customer gain 
is {pij — Pio)gi, and the expected total cost spent for group Gi is {djPij + Cd)gi, 
where the first term in the parenthesis represents expected redemption cost per 
customer and the second the distribution cost. Let denote the ratio of the ex- 
pected customer gain to the expected total cost i.e., = {pij — Pio) / {djPij -\- Cd) ■ 

If Tij is large, the discount by dj for customer group Gi is effective. Thus, we 
compute Tij* = maxi<j<m^y- We compute such ry» for every customer group 
in every attribute. Our algorithm selects the customer group that attains the 
highest Tij* and the rule {Gi, di) is adopted, that is, a coupon of discount di 
will be sent to all customers of Gi. In order to eliminate the possibility that 
a customer receives two coupons, we ignore the customers of Gi in succeeding 
process. We repeat this process until the whole budget is consumed. For prob- 
lem P2, the algorithm is essentially the same as the one for PI except that 
the objective function is replaced by the effective cost. We now formally define 
the effective cost. Let us consider the customer group Gi as before. Then the 
expected number of customers in Gi who start to purchase the brand only after 
the discount becomes dik is {pik — Pi,k-i)gi- Thus, the wasted cost for such cus- 
tomers is {dij — dik){pik — Pi,k-i)gi- Summing this cost over all possible discount 
values, we can compute the total wasted cost as J^k'^oi^ij ~ dik){pik — Pi,k-i)- 
Then the effective cost is written as dijPij — J2i=o(^ij ~ dik){Pik — Pi,k-i)gi- We 
can similarly describe the algorithm for 2D rules, but we omit it here. 



4 Experimental Results and Observations 

Using the greedy algorithm we showed, we analyze the real purchase data from 
the drugstore chain, which contains 30,358 customers’ purchasing history in two 
years from 1996 to 1998. We focus on the product category of laundry deter- 
gent, and we select the customers who purchased the laundry detergent for pre- 
vious two years. There are three major brands in Japan referred by ”Brand_l”, 
”Brand_2” and ”Brand_3”. All results below are from the Brand_l’s point of 
view. The data set consists of three categories of attributes, which are 1) the 
target attribute that takes 1 if a customer bought the target brand (Brand_l) 
otherwise 0, 2) discount price on that purchasing, and 3) predictive attributes 
such as age, brand loyalty, the number of visits, which are based on the cus- 
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tomer’s basic information or the purchase history. We have applied the greedy 
algorithm to the data set. The results are summarized as follows. 

a) Derived Rules: Table 1 shows the top three rules obtained by solving PI 
for distribution cost of 100 yen. The first rule, for example, reads that if we 
issue coupons whose discount price is lower than usual by 100 yen to the 
customers whose purchased share of Brand_2 is in a range of 50% to 58% 
and purchased share of a daily necessity is in a range of 29.9% to 41.4%, 
then purchasing probability increases from 17.8% to 87.9%. These rules are 
very meaningful and interpretable because we can understand what type of 
customers change to the target brand at what discount price. 

b) ID rule vs. 2D rule vs. Random sampling: Figure 1 shows how customer gain 
increases as the budget increases for the solutions of PI obtained by one- 
dimensional (ID) rule, two-dimensional (2D) rule, and random sampling. As 
we expected, 2D rules are superior to ID rules which are superior to random 
sampling. 

c) Training accuracy: To measure the accuracy of the rules we build, we sep- 
arate the original data into training and test sets. We then apply the rules 
generated by using training data against test data. Figure 2 shows the result 
that accuracy (difference between training and test) of ID rules are better 
than 2D rules because overfitting is hegh for 2D. 

5 Conclusion 

We have formulated the problem of determining to which customers and at what 
price the manufacturer should distribute discount coupons as two optimization 
problems, and we developed a heuristic method , which can be easily imple- 
mented in real business field. The experimental result shows that our method 
produces rules which are meaningful and interpretable for real business world. 
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Table 1. Examples of the 2D rules obtained by solving PI for distribution cost=100 
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1 Introduction 

An association rule [1] is an expression of the form X where X and Y 
are sets of items. The intuitive meaning is that transactions (e.g. supermarket 
baskets) containing set X of items tend to contain set Y of items. Two measures 
of intensity of association rule are used, confidence C and support S. 

A 4ft association rule is an expression of the form ip if where and if are 
derived Boolean attributes. The intuitive meaning is that p and if are in relation 
given by the symbol This symbol is called 4ft quantifier. 

Our first goal is to introduce 4ft association rules and to argue for their 
usefulness, see section 2. There are various classes of 4ft association rules, e.g. 
equivalency rules, double implication or implication rules. There are also 4ft 
association rules corresponding to statistical hypotheses tests, e.g. to We 
show that association rule is a special case of 4ft association rules. We will also 
define conditional 4ft association rules, see section 3. 

The second goal is to introduce procedure 4ft-Miner mining both for 4ft asso- 
ciation rule and for conditional 4ft association rule. This procedure is introduced 
in section 4. 



2 4ft Association Rules 



A 4ft association rule is an expression p ^ if where tp and if are derived Boolean 
attributes. Boolean attributes p and if correspond to {0,l}-columns of analysed 
data matrix M (a database relation). They are derived from original columns 
Ai, . . . , Ak of M, see Tab 1. 

We suppose that aip is the value of attribute Ai for object oi, a„,iy is the 
value of attribute Ak for object o„, etc. Relation p ^ if is evaluated on the basis 
of 4ft table, see Tab. 2. 



object 


Ai 


^2 


Ak P 


if 


Ol 


^1,1 


0 - 1,2 


ai^K 1 


0 


On 


^n,l 


On, 2 • • 


an,K 0 


1 



M 


if 


-Ilf 




a 


b 


-np 


c 


d 



Tab. 1 - Data matrix M 



Tab. 2 - 4ft table of M, p and if 
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Here a is the number of objects (rows of M) satisfying both and tp, b is 
the number of objects satisfying tp and not satisfying t/), etc. The quadruple 
< a,b,c,d > is 4ft-table of p and in M, we abbreviate it by 4ft{M,(p,'tp). 

Various kind of dependencies of (f and can be expressed by suitable condi- 
tions concerning 4ft table. Each condition corresponds to the 4ft-quantifier. We 
say that ~ -0 is true in data matrix M (symbolically Va/((p ~ 0 , M ) = true) 
if the corresponding condition concerning 4ft-table of (f and 0 in M is satisfied. 
Examples (see [2], [4]) : 

— 4ft-quantifier =p^s of p-equivalence for 0 < p < 1 and s > 0 is defined by 

the condition > p A a > s. It means that p and 0 have the same 

value (either true or false) for at least lOOp per cent of all objects of M and 
that there are at least s objects of M satisfying both and 0. 

— 4ft-quantifier -^p^s of founded double implication for 0 < p < 1 and s > 0 is 

defined by the condition > p A a > s. It means that at least lOOp per 

cent of objects of M satisfying <p or 0 satisfy both p and 0 and that there 
are at least s objects of M satisfying both ip and 0. 

— 4ft-quantifier ^p^s of founded implication for 0 < p < 1 and s > 0 is defined 
by the condition > p A a > s. It means that at least lOOp per cent of 
objects of M satisfying p satisfy also 0 and that there are at least s objects 
of M satisfying both p and 0. 

— 4ft-quantifier ^p^a,s lower critical implication for 0 < p < 1, 0 < a < 0.5 

and s > 0 is defined by the condition Ylt=a < a A a > s. 

It corresponds to a statistical test (on the level a) of the null hypothesis 
Hq : P(0|<p) < p against the alternative one : P(0|<p) > p. Here P(0|p) 
is the conditional probability of the validity of 0 under the condition p. 

” Classical” association rule X can be also understood as the 4ft association 
rule. Let us suppose that Ai, . . . , Ak are Boolean attributes corresponding to 
particular items, aip = 1 means that supermarket basket oi contains item A\, 
etc., see Tab. 1. Then, association rule {Ai,A 2 } —>■ {A 3 ,A 4 } with confidence C 
and support S can be understood as 4ft association rule Ai AH2 ~^c,S A3 AH4. 
The condition > C A > S concerning 4ft table corresponds to 4ft 

quantifier — >c,S of ’’classical” association rule. 

Let us emphasise differences among 4ft quantifiers =p,s, =^p^s and 

~^C,S- We will use examples withp = 0.9, s — 100, C — 0.9 and S — 0.1, Boolean 
attributes p, 0 and three data matrices Mi, M2, Ms with different 4ft tables, 
see Tab. 3, Tab. 4 and Tab. 5. It is easy to verify that values Val{p ^ ip , Ai) of 
particular 4ft association rules are according to Tab. 6. We suppose Ai = Mi, 
Ai = M2 or AI = M3. 
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Tab. 3 - 4/I(Mi,<P,0) Tab. 4 - 4/t(M2,<p,0) Tab. 5 - 4ft(Ms,p,ip) 




270 



Jan Rauch and Milan Simunek 



data matrix Ai 


Ml 


M2 


M3 


Val{p =0.9,100 Ip j A 4 ) 


true 


false 


true 


Val{p) <t^0.9,100 V’ ; Af) 


true 


false 


false 


0 

0 

Cb 
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true 


true 


true 


Val{p - 5 ^o. 9 ,o.i , M) 


true 


true 


false 



Tab. 6 - Values Val{(f ^ iIj , M.) of particular 4ft association rules 



Let us remark that logical calculi formulae of which correspond to 4ft association 
rules were defined and studied. Various useful theoretical results were achieved 
concerning among others deduction rules, see [2], [4]. Several classes of 4ft as- 
sociation rules were also defined and studied. 4ft-quantifier =p,s is an example 
of equivalence quantifiers, is an example of double implication quantifiers , 
^p,s and are examples of implication quantifiers . 



3 Conditional 4ft Association Rules 



A 4ft conditional association rule is an expression of the form ~ V'/x where 
if, "if and X are derived Boolean attributes. The intuitive meaning is that tp and 
ip are in relation given by 4ft quantifier ^ when the condition x is satisfied. 

If M is data matrix and x is a Boolean attribute derived from columns 
of M then we mean by M/x a data matrix containing exactly all rows of M 
satisfying x- We say that p ^ if/x is true in data matrix M (symbolically 
Val{(fi ^ tp/X , M ) = true) if both there is the row of M satisfying x and if 
Val{ip ~ tp, M/x) = true. 

There is no general relation between Val{(p ^ ip , M) and Val{ip ^ ip/ x , M). 
Let values of p, ip and x at data matrix M4 be according to Tab. 7. Then 
4ft tables 4/t(M4, p, ip) and 4/t(M4/x, p, ip) are in Tab. 8 and Tab. 9, thus 
Val[p <t^o.9,ioo V’ ) M4) = 0 and <t 4 >o.g,ioo V’/x > M4) = 1- It is easy to mod- 
ify M4 such that Val{p <t^o.9,ioo V’ > M4) = 1 and Val{p <t^o.9,ioo V’/x > M4) = 0. 
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Tab. 7 - Data matrix M4 



Tab. 8 - 4:ft{M4,p,ip) Tab. 9 - 4ft{M4/x,p,tp) 



4 Procedure 4ft-Miner 

Procedure 4ft-Miner mines for 4ft association rules p ^ ip and for conditional 
4ft association rules p ^ ip/ . 4ft quantifiers of fourteen types can be used, see 
4.1. Derived Boolean attributes p, ip and x are conjunctions of basic Boolean 
attributes, they are called antecedent succedent and condition respectively, see 
paragraph 4.2. Input and output of 4ft-Miner are described in paragraphs 4.3 
and 4.4. Some further features of 4ft-Miner are mentioned in paragraph 4.5. 
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4.1 4ft-quantifiers of 4ft-Miner 

Fourteen types of 4ft quantifiers are implemented in procedure 4ft-Miner. All 
of them are defined by the condition concerning 4ft table. There are three im- 
plication quantifiers: =^p,a,s (s®® section 2) and the 4ft quantifier 

of upper critical implication defined by a condition > 

a Aa > s. 

Further there are three analogous double implication quantifiers: '^p,s (see 
section 2), ^p^a,s three analogous equivalence quantifiers: =p^s (see 

section 2), =P,a,s^ see also [2], [4], 

There are also quantifiers corresponding to t®st and to Fisher’s test and the 
4ft quantifier defined by the condition ad > .he where J > 0. Quantifier -Ac,s 
corresponding to ’’classical” association rule (see section 2) is also implemented. 
Last implemented type of 4ft quantifiers is the 4ft quantifier corresponding to 
condition max(^^, < 7 where 0 < 7 < 1, see [6]. 

4.2 Antecedent, Succedent and Condition 

Antecedent, succedent and condition are conjunctions of basic Boolean attributes. 
Basic Boolean attribute is of the form A[a] where A is the column and cr is a 
subset of a set of possible values of A. Boolean attribute A[a] is true in the row 
o of analysed data matrix if the value in the column A and in the row o belongs 
to the set a. Thus, if Ai[3, 5] is true in row Oi, then aiq = 3 or aip = 5. An 
example of the conditional 4ft association rule: 

Ai [1,3, 4] A A3 [5, 6] ^ 0.9 A5[8,12] A A7[11, 12, 14] A As[2] / Ag[4, 5, 6, 7, 8] . 

4.3 Input of 4ft-Miner 

An input of 4ft-Miner is given by: (i): The analysed data matrix, (ii): 4ft quan- 
tifier. (iii): Simple definition of all antecedents to he automatically generated. It 
consists of: (iii-a) : A list of all columns of data matrix, from which basic Boolean 
attributes of antecedent will be automatically generated, (iii-b): Simple defini- 
tion of the set of all basic Boolean attributes to be automatically generated 
from each particular column, (iii-c): Minimal and maximal number of basic 
Boolean attributes in each generated antecedent, (iv): Analogous definitions of 
all succedents and of all conditions to be automatically generated. 

The set of all basic Boolean attributes to be generated from a particular 
column is given by a type of subsets and by the minimal and the maximal number 
of particular values in the subset. There are five types of subsets to be generated: 
all subsets, intervals, left cuts, right cuts and cuts. 

Examples of a definition of the set of basic Boolean attributes for column 
A with possible values {1,2, 3, 4, 5}: (1) all subsets with 2-3 values defines basic 
Boolean attributes A[l, 2], A[l, 3], . . ., A[4, 5], A[l, 2, 3], A[l, 2, 4], . . ., A[3,4, 5]; 
(2) intervals with 2-3 values defines basic Boolean attributes A[l,2], A[2,3], 
A[3, 4], A[4, 5], A[l, 2, 3], A[2, 3, 4] and A[3, 4, 5]; (3) left cuts with most 3 values 
defines basic Boolean attributes A[l], A[l,2], A[l,2,3j. 
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4.4 Output of 4ft-Miner 

4ft-Miner automatically generates all 4ft association rules or all conditional 4ft 
association rules given by the conditions (ii) - (v) (usually 10^ - 10^) and ver- 
ifies them in data matrix given by (i). Output of 4ft-Miner is the set of all 4ft 
association rules (all conditional 4ft association rules) true in data matrix given 

by (i). 

Usual output of 4ft-Miner consists of tens or hundreds of true 4ft association 
rules (true conditional 4ft association rules). There are strong tools for dealing 
with output of 4ft-Miner. It is possible to sort output 4ft association rules by 
various criterions. Flexible conditions can be used to define subsets of output 4ft 
association rules. It is also possible to export defined subsets in several formats. 



4.5 Some Further Features of 4ft-Miner 

4ft-Miner works under WINDOWS, analysed data matrix can be stored in a 
database (ODBC is applied). New values can be defined for particular columns 
(e.g. intervals or groups of original values). New columns of data matrix can be 
also defined (SQL - like) and used in conditions (iii) - (v). 

4ft-Miner works very fast. Usual task (data matrix with 10^ rows, several 
millions of 4ft association rules to be generated and verified) requires only several 
minutes at PC with Pentium II and with 128 MB of operational memory. Several 
optimisation techniques and deep theoretical results are used, e.g. bit strings for 
representation of analysed data matrix [3] and deduction rules [4]. 

Let us remark that 4ft - Miner can deal with missing information [2] , [5] . Let 
us also emphasise that 4ft-Miner is a GUHA procedure in the sense of [2]. 
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1 Introduction 

Various techniques have been proposed for rule discovery using classification 
learning. In general, the learning speed of a system using genetic programming 
(GP) [1] is slow. However, a learning system which can acquire higher-order 
knowledge by adjusting to the environment can be constructed, because the 
structure is treated at the same time. 

On the other hand, there is the Apriori algorithm [2] , a rule generating tech- 
nique for large databases. This is an association rule algorithm. The Apriori 
algorithm uses two values for rule construction: a support value and a confi- 
dence value. Depending on the setting of each index threshold, the search space 
can be reduced, or the candidate number of association rules can be increased. 
However, experience is necessary for setting an effective threshold. 

Both techniques have advantages and disadvantages as above. In this paper, 
we propose a rule discovery technique for databases using genetic programming 
combined with the Apriori algorithm. By using the combined rule generation 
learning method, it is expected to construct a system which can search for flexible 
rules in large databases. 

2 Proposed Rule Discovery Technique 

We propose a rule discovery technique which combines GP with the Apriori 
algorithm. By combining each technique, it is expected to increase the efficiency 
of the search for flexible rules in large databases. 

The following steps are proposed for the rule discovery technique. 

1. First, the Apriori algorithm generates the association rule. 

2. Next, the generated association rules are converted into decision trees which 
are taken in as initial individuals of GP. The decision trees are trained by 
GP learning. 

3. The final decision tree is converted into classification rules. 

This allows effective schema to be contained in the initial individuals of GP. 
As a result, it is expected to improve the GP’s learning speed and its classification 
accuracy. 
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For conversion from the association rule into decision trees, we use the fol- 
lowing procedures. 

1. For the first process, the route of the decision tree is constructed, assuming 
the conditions of the association rule as the attribute-based tests of the 
decision tree. 

2. In the next process, the conclusions of the association rule is appended on 
the terminal node of this route. 

3. Finally, the terminal nodes which are not defined by the association rule are 
assigned candidate nodes at random. 

For conversion from the GP’s decision tree to the classification rule, we use 
the process proposed by Quinlan [3] . 

3 Experiments 

To verify the validity of the proposed method, we applied it to the house-votes 
data from UCI Machine Learning Repository [4], and medical database for oc- 
currence of hypertension [5]. From here on all occurrence of GP uses Auto- 
matically Defined Function Genetic Programming (ADF-GP) [1] including the 
proposed method. In the proposed method, we took the association rule gener- 
ated by Apriori algorithm as initial individuals of GP. We compared the results 
of the proposed method against GP. We use house-votes database as small test 
database expressed by discrete values, and hypertension database as large test 
expressed by continuous values. 



3.1 Application to house-votes Data 

For evaluation, we used house-votes data from UGI Machine Learning Reposi- 
tory [4] . We compared the results of the proposed method with GP. The evalu- 
ation data contains 16 attributes and 2 classes. The attributes are for example 
“handicapped-infants” and “water-project-cost-sharing ” etc. They are expressed 
by 3 values: “y ” , “n” , and “?” . And the 2 classes are “democrat” and “repub- 
lican ” . 50 cases out of the total 435 data of house- votes were used for training 
data. 

We extracted the association rule from the database by the Apriori algorithm. 
We applied the Apriori algorithm to a data set excluding data with the “?” 
value, because “?” value means “others”. In the following experiment, we used 
minimum support value (= 30) and minimum confidence value (= 90). As a 
result of the experiment, 75 rules were generated. 

Next, the above generated 75 rules were taken into the initial individual. The 
result of the evaluation of each fitness value is shown in figure 1, and the result 
of best individual is shown in table 1. 

By using GP, inference accuracy did not improve rapidly. However, the pro- 
posed method showed fast learning and achieved high accuracy. Gomparing the 
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Fig. 1. Evaluation of Each Fitness Value 
Table 1. Experiment Best Individuals Result (House- votes) . 





training ( %) 


all data ( %) 


nodes 


depths 


generations 


ADF-GP 


100.0 


86.0 


11 


3 


586 


Apriori -|- 
ADF-GP 


98.0 


92.9 


9 


2 


235 



best individual results, the proposed method showed better results than GP, 
except for accuracy against the training data. Concerning the results of training 
data, GP may have shown overfitting, but its proof could not be obtained by 
only this result. 

The rules were converted from the constructed decision tree removing invalid 
rules and meaningless rules. The rules’ total accuracy was 94.8%. 



3.2 Application to Medical Database 

We applied a medical diagnostic system for the occurrence of hypertension. We 
compared the results of proposed method with GP. Most of the data values are 
expressed as continuous values, and the size of the database is larger than the 
house-votes database. 

The occurrence of hypertension database contains 15 input terms and 1 out- 
put term. There are 2 kinds of intermediate assumptions between the input 
terms and the output term [6]. Among the input terms, 10 terms are categorized 
into a biochemical test related to the measurement of blood pressure for past 
five years, and the other terms are “Sex”, “Age”, “Obesity Index”, “y-GTP”, 
and “Volume of Alcohol Gonsumption” . 1 output term represents whether the 
patient has an attack of hypertension for the input record. The database has 
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1024 patient records. In this paper, we selected 100 occurrence data and 100 
no-occurrence data by random sampling, and this was used as the training data. 

The association rule has been extracted from the database by the Apri- 
ori algorithm. The Apriori algorithm was used after these attributes had been 
converted into binary attributes using the average of each data, because the 
continuous value attributes were included in this database. To search for the 
relationship between the minimum support value and the minimum confidence 
value and the number of rules, we experimented with the threshold patterns. 
(Refer to the result table 2 ) In the following experiment, we used minimum 
support value (= 30) and minimum confidence value (= 90). 



Table 2. Relations between Thresholds and Number of Rules 



Minimum Support Value 


Minimum Confidence Value 


Rules 


25 


75 


396 


30 


75 


125 


25 


90 


187 


30 


90 


33 



Next, the 33 rules generated by the Apriori algorithm were taken into the 
initial individual. The result of best individual is shown in table 3. 

By using GP, inference accuracy did not improve rapidly. However, the pro- 
posed method showed fast learning and achieved high accuracy. 



Table 3. Experiment Best Individuals Result (Hypertension). 





training ( %) 


all data ( %) 


nodes 


depths 
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ADF-GP 


89.5 


66.3 


41 
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18553 


Apriori -|- 
ADF-GP 


89.5 


74.9 


49 
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671 



When the rules were converted from the decision tree, invalid rules and mean- 
ingless rules were removed. Each ratio of the number of effective rules to genera- 
tion rules was 37.5% (by GP) and 50.0% (by proposed method). (Table 4 shows 
3 rules generated with each technique, chosen by the highest support value.) 

By using GP, many invalid rules and many rules which were difficult to in- 
terpret were generated. Gompared to GP, the proposed method showed decrease 
in the support value and improvement in accuracy. The proposed technique im- 
proved the ratio of effective rules and the accuracy. 
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Table 4. Comparison of Generated Rules (Size and Inference Accuracy) 



Technique 


Size 


Support Value( %) 


Wrong ( %) 


ADF-GP 


4 


41.4 


48.4 




2 


36.0 


24.1 




4 


22.6 


8.2 


Apriori-l-ADF-GP 


2 


17.7 


39.2 




4 


15.5 


13.2 




4 


13.8 


19.2 



4 Concluding Remarks 

In this paper, we proposed a rule discovery technique for databases using genetic 
programming combined with association rule algorithms. 

In the future, we will study the following 4 topics related to the proposed 
method. The first topic is to apply the method to other verifications. The sec- 
ond topic is to further research the conversion algorithm from the association 
rule to a decision tree with high accuracy. The third topic is to extend the pro- 
posed method to multi- value classification problems. The fourth topic is to do a 
theoretical analysis about the mechanism of the overfitting. 
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1 Introduction 

The purpose of knowledge discovery system is to discover interesting patterns in 
a given database. There exist many types of patterns and this paper focuses on 
discovery of classification rules from a set of training instances represented by 
attribute values and class labels. A classification rule restricts values of attributes 
in its body and predicts a class of an instance that satisfies the body. In usual, a 
body is a conjunction of conditions on attribute values. This paper deals with a 
different type of rule whose body is a threshold function and requires at least m of 
n conditions in it are satisfied. Such kind of rules have much more representation 
power than rules with conjunctive bodies and are suitable for many real world 
problems such as diagnoses of diseases in which observation of more symptoms 
of a certain disease leads more confident diagnosis[3,8]. 

For m-oi-n concepts, small threshold for many conditions, i.e. small m for 
large n, easily achieves large support. This property drastically increases the 
number of rules with large support and high accuracy, and makes it difficult to 
select interesting rules with fixed lower bounds of support and accuracy. This 
paper extends the interestingness of classification rules proposed in [7] and tries 
to resolve this problem by evaluating rule’s support and accuracy with those of 
simpler rules. 



2 Classification Rules 

A classification rule is a rule that restricts values of attributes in its body and 
classifies an instance that satisfies the body into the class in its head. In usual, 
a body is an conjunction of conditions on attribute values and requires all con- 
ditions in it are satisfied. This paper deals with a different type of classification 
rule such as 



Ri : m — of — {ai = ui, . . . , a„ = n„) class = c, 

where a, is an attribute and n, is one of its possible values. This rule means that 
if an instance satisfies at least to of n conditions, a, = Vi,i = 1, . . . , n, then the 
instance belongs to class c. For simplicity, we use 6, to represent a, = n, and 
denote the body as 6i + • • • + > to. 
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3 Evaluation of Rules 

The most basic criteria to evaluate classification rules are support and accuracy. 
Support is a probability that both of a body and a head are satisfied, and repre- 
sents generality of a rule to be evaluated. Accuracy is a conditional probability 
that a head is satisfied on the condition that a body is satisfied, and represents 
reliability of a rule. Then, it is important to discover rules with large support 
and high confidence. A simple way to find such rules is to give lower bounds of 
support and accuracy and to select only rules whose support and accuracy are 
higher than the lower bounds [1]. 

However, fixed lower bounds are insufficient to select interesting rules and 
sometimes accept boring rules[7]. Let’s consider a rule b\ ^ c and its specializa- 
tion 6i A 62 ^ c. If we already know the rule b\ ^ c and its accuracy is higher 
than that of 61 A 62 ^ c, then the second rule doesn’t give us any new informa- 
tion and is meaningless even if its accuracy is higher than the given lower bound. 
This problem becomes serious for rules with m-oi-n bodies because rules with 
small threshold for many conditions, i.e. small m for large n, easily achieves 
large support and typical lower bounds of support and accuracy tend to lead 
huge number of rules. 

The basic idea to resolve this problem is to dynamically set lower bounds of 
support and accuracy by using support and accuracy of shorter rules. In other 
words, we require longer rules must sufficiently improve support and/or accuracy 
from shorter rules. To evaluate a rule with n conditions in its body, 

Ri '■ bi + ■ ■ ■ + bn > m ^ class = c, 

we use the following rules with n — 1 conditions in their bodies. 

i?2 : 61 + • • • + bi-i + bi-^.l + ■ ■ - bn > m ^ class = c, 
i?3 : 61 -b • • • -b bi-i + + ■ • - bn > m — 1 ^ class = c. 

Because i?i is a generalization of R2 and a specialization of R3, R\ is not so 
important if its support is not sufficiently higher than R2 or its accuracy isn’t 
sufficiently higher than R3. The problem is what “sufficiently higher” means. As- 
suming independence between 6, and 61 , . . . , we can calculate 

support and accuracy of i?i from those of R2 and R3 as follows. 

Sup^^lj^{Ri) = P{c A bi + ■ ■ ■ + bn > m) 

= P{c A 61 -b • • • + bi—i -b + • • • + > Til) 

-\-P{c A A 61 -b • • • + bi—i -b -b • • • + = m — 1) 

= Sup{R2) + P{bi\c) {Sup{R3) - Sup{R2 )) , 

= P{c\bi + ■ ■ ■ + bn > m) 

Supi%(Ri) 

Sup(R2) , / Sup(Rs) _ Sup(R2) ^ 

Acc(R 2 ) \Acc(Ri) Acc(R 2 ) ) 
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Table 1. The number of m-of-3 rules with sufficiently high accuracy with respect 
to constraint (2). We set minimum support and minimum accuracy to 2% and 90% 
respectably. 



database 


dace — OO 


dace = 0% 


dace = 10% 


dace = 20% 


dace = 50% 


krkp 


3,086 


2,042 


65 


8 


0 


mushroom 


85,770 


63,257 


4,716 


2,009 


89 


soybean 


22,853 


18,413 


1,899 


1,149 


96 


voting 


4,497 


1,873 


178 


103 


21 



Table 2. The number of m-of-3 rules with sufficiently large support with respect 
to constraint (1). We set minimum support and minimum accuracy to 2% and 90% 
respectably. 



database 


dsup — oo 


dsup — 09o 


dsup — 1% 


dsup — 2% 


dsup — 59o 


krkp 


3,086 


1,175 


101 


13 


0 


mushroom 


85,770 


56,024 


4,256 


1,306 


140 


soybean 


22,853 


18,907 


731 


84 


0 


voting 


4,497 


2,123 


504 


249 


27 



If the real support and/or accuracy of Ri is comparable to or smaller than above 
values, then i?i only shows regularity that can be easily expected from shorter 
rules. This leads the following constraints of interesting rules. 

Sup(Ri) > 5upWp(i?i) + dsup, 1 < Vi < n, (1) 

Acc(Ri) > Acc^lplyRi) + dace, 1 < Vi < n, (2) 

where dgup and dace are domain dependent parameters. 



4 Experiments 

We applied the constraints (1) and (2) to databases in UCI repository and ex- 
plored how many rules and what kind of rules were selected. Because of the lack 
of space, we only report the results for m-of-3 rules. For all experiments, we 
first enumerated m-of-3 rules whose support and accuracy were higher than 2% 
and 90% respectably and filtered them with the constraints (1) and (2). Table 
1 reports the number of rules when the accuracy constraint (2) was applied and 
table 2 reports the results with the support constraint (1). Even if the databases 
used in the experiments are not so large and we focused only on short rules with 
n = 3, there were thousands or more rules with 2% or larger support and 90% 
or higher accuracy(dacc = ~oo in table 1 and dgup = — oo in table 2). However, 
we could select non-trivial rules from them by applying the constraints (1) or 
(2) with appropriate values of dace and dgup- 
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5 Related Works 

For discovery of rules, one of the most important problem is what kind of rules 
should be discovered. Many previous works dealt with this problem and pro- 
posed various criteria of rules to filter out less interesting rules. One approach 
is to discover typical rules in a given database that cover many instances and 
classify them with high confidence [4]. Entropy-base criteria such as J-measure[5] 
is an example of the criteria for typical rules. Another approach is to discover 
exceptional rules. Because the exceptional rules cover relatively small number of 
instances and a different type of criteria is required. Suzuki[6] proposed a crite- 
rion to evaluate a pair of typical rule and its exception based on the difference 
of accuracy of the two rules. My previous work[7] proposed a criterion for rules 
with conjunctive bodies based on expected value of accuracy and showed the 
criterion could filter out trivial rules. 

6 Summary 

This paper focused on discovery of classification rules whose bodies are m-oi-n 
concepts and proposed criteria to select only meaningful rules from huge number 
of rules with large support. The criteria compare support and accuracy of a rule 
to be evaluated with those expectations calculated from simpler rules. 
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Abstract. This paper discusses three issues in organizing a successful 
knowledge discovery contest based on our experience with KDD Chal- 
lenge 2000. KDD Challenge 2000 has been a success with its three unique 
features: four preliminary contests, four data sets, and two program com- 
mittees. Based on this experience, we consider that three issues: clear mo- 
tivations of a contest, supports for domain experts, and promotion for 
participants, are mandatory for a successful organization of a knowledge 
discovery contest. 



1 Introduction 

A knowledge discovery contest (KDC) is a systematic attempt to evaluate dis- 
covery methods of participants with a set of common data sets or common 
problems. The interest in KDCs has been increasing in the KDD (Knowledge 
Discovery in Databases) community. Major KDCs include KDD-Cup, Discov- 
ery Challenge [4], and KDD Challenge [3], which were held in conjunction with 
KDD conferences (International Conference on Knowledge Discovery & Data 
Mining), a PKDD conference (European Conference on Principles and Practice 
of Knowledge Discovery in Databases), and a PAKDD conference (Pacific- Asia 
Conference on Knowledge Discovery and Data Mining) respectively. Besides, 
UCI KDD Archive [5] provides various benchmark problems of KDD with do- 
main knowledge, and can be considered as a promising source of KDCs. 

Since the goal of KDD can be summarized as extraction of useful knowl- 
edge from a huge amount of data, a knowledge discovery (KD) system should 
be evaluated from the usefulness of its output. A KDC provides a systematic 
opportunity for such evaluation, and is thus highly important in KDD. However, 
there are several issues in organizing a successful KDC. We here discuss such 
issues based on our experience with KDD Challenge 2000 [3] . 

2 KDD Challenge 2000 

As a KDC, KDD Challenge 2000 provides three unique features. First, KDD 
Challenge 2000 was held based on four domestic KDCs each of which had been 
organized by JSAI (Japanese Society for Artificial Intelligence), and was thus 
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carefully prepared. Second, KDD Challenge 2000 provided four data sets; Diag- 
nosis data on meningoencephalitis, Bacteriological examination data. Treatment 
history data of patients under collagen diseases, and Mutagenicity data; to its 
challengers and was thus richest in opportunities. Each data represents a single 
table-formatted data set, data sets with set attributes and numerous missing 
values, highly irregular time-series data, and data sets with chemical structures 
respectively. Third, KDD Challenge 2000 was organized by a program commit- 
tee with data mining specialists and a supervisory board with domain experts, 
and was thus supported by people with various backgrounds. Several organizers 
are a data mining specialist as well as a domain expert, and played a mandatory 
role in the contest. The concern of participants can be classified as a domain- 
specific problem such as mutagenic activity and a data mining problem such as 
rule discovery from graph structures. Professor J. M. Zytkow, who contributed 
many results to automated discovery from the viewpoint of history of science, is 
interested in promoting collaborative discovery among participants in a KDC. 



3 Issues in a Knowledge Discovery Contest 



First, motivations of a KDC should be clearly settled and announced. Such 
motivations can be decomposed into academic benefits, which mainly represent 
development of an effective KD method, and domain benefits, which mainly 
represent discovery of useful knowledge. Evaluation of KD methods depends 
on which motivation be emphasized. For example, suppose you have two KD 
methods: one discovered 99 useful rules and 1 useless rule, and another discovered 
1 incredibly useful rule and 99 useless rule. Motivation for academic benefits 
favors the former method, while motivation for domain benefits favors the latter 
method. 

Second, measures to support domain experts are needed. Domain experts 
can be reluctant due to several factors including noise in data and immaturity 
of the domain. The target problem of a KDC should be appropriately settled 
considering the evaluation process. Moreover, motivation for academic benefits 
necessitates clarifications of successes and failures with respect to characteristics 
of a KD method. It is also desirable that interestingness index is decomposed 
into several indices such as validness, novelty, and usefulness. All of these require 
considerable efforts to domain experts, but improve the quality of a contest. 

Third, increasing the number of participants is mandatory for the success of 
a KDC. Currently, a participant should perform the whole KDD process [2] by 
himself, and suffers from its iterative and interactive aspects. We predict that a 
future KDC will allow partial participations by either combining partial results 
systematically or promoting collaborations among participants. The former re- 
quires standardization of a KDD process, which is considered highly important 
in KDD. The latter is being experimented in Discovery Challenge 2000 [1] which 
will be held in conjunction with PKDD-2000. 
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4 Conclusions 

The goal of KDD can be summarized as extraction of useful knowledge from 
a huge amount of data, thus a KD system cannot be evaluated solely by the 
given data. A KDC offers a systematic opportunity for subjective evaluation by 
domain experts, and is considered as mandatory in KDD. This article discussed 
three issues from the viewpoint of organizing a successful KDC based on our ex- 
perience with KDD Challenge 2000. We should steadily continue accomplishing 
academic and domain achievements in the coming KDCs, and also extending the 
application domains to a broader range. 
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1 Introduction 

Despite the growing popularity of semi-structured data such as Web documents 
and bibliography data, most data mining researches have focused on databases 
containing well structured data like RDB or OODB. In this paper, we try to 
find useful cissociation rules from semi-structured data. However, some cispects 
of semi-structured data are not appropriate for data mining tcisks. 

One problem is that semi-structured data contains some degree of irregularity 
and it does not have fixed schema known in advance. The Icick of external schema 
information make it a very challenging task to use standard database access 
method or to apply the algorithms of rule mining. Therefore, schema discovering 
is considered to be necessary for rule mining. 

Another problem of association rule mining is computing cost. If discov- 
ered schema pattern contciins redundant attributes, they affect mining efficiency. 
Therefore, we try to feedback knowledge that obtciined from the result of associ- 
ation rules to schema discovering. It means rule mining and schema discovering 
can give benefit to each other. In this way, by integrating knowledge of both 
rule mining and schema discovering, we can extract useful association rules from 
semi-structured data efficiently. 

2 Schema Discovering for Mining Association Rnles 

2.1 Prototype-based Model for Semi-Structured Data 

In order to make it easy to manipulate semi-structured data, we begin by rep- 
resenting semi-structured data in prototype-based model [1]. Prototype-based 
model is proposed in the field of object-oriented programming, but there is no 
distinction between classes and instances like traditional object-oriented model. 
Therefore, each object can have its own structure. Furthermore, slots which store 
the attributes and data values are evaluated dynamically. So there is no need for 
system to know the type of data in advance. Such features of prototype-based 
model is considered to fit the features of semi-structured data. 

In our approach, we use BibTeX data as a kind of semi-structured data. 
Structures and patterns of those bibliography data depend on users or journals, 
so we can consider them as semi-structured data. In our prototype-based model. 
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each slot of prototype objects stores the each attribute of BibTeX data such 
as “title” and “author.” Those attributes and their values contain some irreg- 
ularities, but in prototype-based model, we need not know what kind of data 
exists in each BibTeX data in advance. As shown in Fig. 1, we represent each 
BibTeX data as tree model to apply schema discovering algorithm. Detected the 
most typical common structure in schema discovering can help us applying the 
algorithm of association rule mining. 



©InProceedings 
author = 
title = 
booktitle 
editor = 
publisher 
address = 
keyword = 
month = 
year = 
pages = 
note = 
id = 
list = 



{Obj ect-ID, 

" A. SMITH and M.TOM", 

” Visualizing tool for 
" International Conference On... 
" AAA & BBB . . . " , 

" ACM Press, New York", 

" Menlo Park, CA, USA", 

" data mining, visualizing" 

" aug", 

" 1996", 

" 214--219", 

" ftp://ftp /**.ps}", 

" ML75", 

" MLO KDD" , } 



Proceedings 




author first 

title second 



publisher 
year 
month 
editor 
page 




first 

second 

third 



address 



city 

state 

country 



Fig. 1. Example of BibTeX Data. 



2.2 Discovering Typical Schema Pattern 

In order to clean semi-structured data like BibTeX data, we detect the most 
typical common structure of them by using the same idea of Wang & Liu’s 
algorithm [2], which is called schema discovery. Once a most typical schema 
pattern is discovered, it filters out the redundant attributes that do not match 
with the schema pattern. By storing all data into that pattern, we can translate 
them into cleaned structured data. A summary of this process is as follows: 

Definitions 

• Let Pi denotes path expression which is a path representation from root 
node to leaf node. A k-tree expression is a tree-expression containing k 
leaf nodes and can be represented by a sequence pi...pk. 

• Consider a tree-expression te. The support of te is the number of the root 
document d such that te is “weaker than” d. Intuitively, if all structural 
information of te^ is found in fC 2 , te^ is weaker than te 2 - MINSUP de- 
notes user-specified minimum support and te is frequent if the support of 
te is not less than MINSUP. te is maximally frequent if te is frequent 
and is not weaker than other frequent tree-expressions. The discovery 
problem is to find all maximally frequent tree expressions. 

Algorithm 

1. MINSUP is specified by the user. 

2. For all frequent 1-tree expressions, Fi are found in the form of pass- 
expressions. 

3. Every frequent k-tree expressions Pi...pk-iPk is constructed by two fre- 
quent (/c-l)-tree expressions Pi...pk- 2 Pk-i and Pi...pk- 2 Pk- We represent 
all frequent k-tree expressions Pi...pk-iPk as Fk. 
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4. The actucil frequent k-tree expressions in are found. This step prunes 
all non-maximally tree-expressions. 




Fig. 2. Constructing (k-l-l)-frequent Tree Expression 

Fig. 2 shows the Ccise that frequent 3-tree expression P 1 P 2 P 3 is constructed from 
two frequent 2-tree expressions piP 2 and piPs- We can treat the final /c-frequent 
tree expression as the most typiccil schema pattern. After we find it and clean 
data, we can apply the algorithm of mining association rules. 

2.3 Mining Association Rules with Concept Hierarchy 

Association rules are powerful abstractions to understand a large amount of 
data by finding interesting regularities which satisfy two given thresholds of 
support and confidence. However, one of the problem for unavailability of useful 
association rules is that rules are generated with no background knowledge. In 
order to find more interesting and informative rules, it can be considered to use 
concept hierarchies as the representation of background knowledge [3]. 

In the case of knowledge discovery from BibTeX data, the most useful at- 
tribute is “Ff/e” attribute. It is considered to contain important words which 
denote author’s main interest. So we associate each word in “Ff/e” with the 
concept in concept hierarchy, and generate not only original association rules 
but also their child rules which contain their child nodes in concept hierarchy. 

Fig. 3 shows the example of generated child rules. These child rules can tell 
who is the main researcher in each field defined as a child of “data mining.” Such 
a tendency cannot be discovered by original rules alone. 




Fig. 3. Discovered Rule with Concept Hierarchy 






288 



Kohei Maruyama and Kuniaki Uehara 



For another example, assume that following rules are found by using the 
dataset we got from bibliography search engine. 

— { 3 omnal= Lecture note in C.S., Yesi=1998 } => { Title=dafa mining } 

— { 3 omnal= Lecture note in C.S., Year=1999 } => { Title=dafa mining } 

From these rules, we can know that papers on data mining are constantly written 
in recent years. But we cannot discover any more knowledge from these rules. 
But by using concept hierarchy, we can discover child rules as shown in Fig. 4. 



E 






journal 


year 


title 




Lcture Note in C.S. 


1998 


Data Mining 




Lcture Note in C.S. 


1999 


Data Mining 




journal 


year 


title 


Lcture Note in C.S. 


1998 


Association Rule (37. 5%) 


Lcture Note in C.S. 


1999 


Visualizing (28. 6%) 




Fig. 4. Discovered Rules and Their Graph 

Generated rules show that in “1998,” some papers about “association rule” 
on the data mining field are written, but in “1999,” papers on “visualizing” 
appears several times while there is no paper on “association rule. ” It means 
that target of interest has been changed from “association rule” to “visualizing” 
under the same “data mining” field. 

As shown in Fig. 4, by using concept hierarchy, we can generate more infor- 
mative rules for discovered original rules. In this way, we may find the important 
keywords which are frequently appeared recently but not discovered because of 
the largeness of the itemsets we choose or of the height of support or confidence. 
One of our purpose is to find these hidden informative rules behind the discov- 
ered rules which is meaningless at first sight and discover knowledge like such a 
tendency of particular researcher, research field, and so on. 



3 Feedback from Rule Mining to Schema Discovering 

As mentioned above and our previous paper [4], schema discovering helps the 
task of rule mining from semi-structured data, and we could discover some useful 
rules. One problem here is that the algorithm should compute all combinations 
of items and it is crucial for mining efficiency. Therefore, if we can get rid of 
useless attributes in advance which do not appear as a result of mining, we will 
be able to carry out the rule mining efficiently in case that a need for next mining 
arises such as update of database. Therefore, we consider to feedback the result 
knowledge of rule mining in turn to improve schema pattern. Fig. 5 shows the 
outline of how this knowledge integration works. 

In Fig. 5, left tree is a typical schema pattern discovered by schema discover- 
ing. After refining BibTeX data by storing all of them in it, we extract association 
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Fig. 5. Outline of Our Approach for Knowledge Integration. 



rules. If discovered rules do not include data value of some attributes, it can be 
pruned from the schema pattern to make it a less redundant form. For example, 
“address. city” attribute is discovered in typiccil schema pattern, but it doesn’t 
appear in the extracted cissociation rule. Therefore, it can be pruned. In this 
way, we can generate improved schema pattern and use it for next mining. By 
using less redundant schema pattern, we will generate new rules efficiently. That 
is, rule mining and schema discovering can give mutucil merit to each other in 
case of handling semi-structured data. 

Some researchers try to feedback the benefit of data mining to information 
extraction system in similar approach [5], They try to improve bottom-up infor- 
mation extraction which is considered to be appropriate for incremental retriev- 
ing of data. On the other hand, we try to improve top-down schema discovering 
which is appropriate for efficient mining for large databcise. Our consideration 
here is that computational cost must be a main concern in data mining, but we 
also have to examine the incremental features of semi-structured data. 



4 Conclusion and Future Works 

In order to manipulate semi-structured data such as bibliography data, we adopt 
prototype-based model and clean data by using schema discovering. Also, we 
generate child rules to obtain more informative rules. Furthermore, we propose 
that the discovered rules can be reused to improve schema pattern. That is the 
knowledge integration of our Ccise. 

As future works, we have to examine efficiency of our approach and use of 
other semi-structured data like XML data. Furthermore, we plan to see our 
approach from the point of information extraction field. 
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1 Introduction 

Human motion data is practiccilly used in some domains such cis movies, CGs 
and so on. Creators use motion data to produce exciting and dangerous scenes 
without recil actors. Human motion data has following features: 

1. correlation between body parts 

We control all body parts in cooperation and motion data consists of the 
information of cooperation. For example, we swing both arms in turn to 
keep wcilking straight. There exists correlation between arms and legs. 

2. correlation in the flow of contents 

There is tendency that certain movement likely to occurs after another move- 
ment. For example, once we raise our hands, we probably put our hands down 
in certciin interval of time. This is correlation in time flow of motion. 

Because human motion data has these kinds of features, motion data should 
be treated cis multi-stream [4]. Multi-stream includes unexpectedly frequent or 
infrequent co-occurrences among different streams. This means that an event on 
one stream is related to another events which locate on other streams and seem 
to have nothing to do with the former event. Time series pattern of stock price 
is a good example. Rise and fcill of price on some stocks obviously cause price of 
one stock to rise and fall. If we analyze the multi-stream of time series for some 
stock price and can discover correlation between all streams, the correlation help 
us to decide better time to buy stocks. 

Correlations which are discovered from multi-stream of human motion char- 
acterize a specific motion data. Furthermore, those correlations become basic 
elements that can be used to construct motion with combination of themselves, 
just as phonemes of human voice do. We call those basic elements primitive 
motions. As the result, we can use primitive motions as indices to retrieve and 
recognize motion. 

We introduce the method that finds correlations based on contents of motion 
and converts motion data into combinations of primitive motions. Sets of prim- 
itive motions are used cis indices to retrieve and recognize motion. Discovered 
correlations are visually comprehensible indices for multi-stream of motion, be- 
cause those are found according to contents of the multi-stream. Therefore, those 
correlations are effectively used as indices for multi-stream of human motion. 
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2 Human motion data as multi-stream 

Motion data, captured by motion capture system, is one of the multi-stream 
data. The data is mass of information for body parts. This means that motion 
captured data consists of streams with 3-D time series patterns, representing 
positions of major body joints. The data is used mainly for entertainment and 
research domains. In movies and CGs, captured data is effectively used to create 
scenes that real actors could never play. 




Fig. 1. The motion capture system 



The multi-stream of motion includes correlations discussed in Sec. 1. Fig. 2 
shows an example of motion data which represents that after one raised one’s 
right hand, one starts putting one’s left hand down. 




Position(hight) data of right hand 




Position(hight) data of left hand 



Fig. 2. An example of correlation in multi-stream of human motion. 



The example shows that motion data has features as multi-stream, such as: 

— temporal distance on each stream 

Consistent contents on each stream do not always occur in fixed temporal 
distance. In Fig. 2, “Raising right hand” and “Putting left hand down” occur 
twice in each stream, however, temporal distance of occurrences between 
“Raising right hand” and “Putting left hand down” on left side differs from 
that on right side. There is always variety of those distances. 

— period among contents 

Segments with content that characterize motion do not always occur in fixed 
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period. In Fig. 2, “Raising right hand” and “Putting left hand down” occur 
twice in streams with period of certain time length and that is a characteristic 
on motion data, however, the period is not always the same. There also exists 
the variety. 

A personal characteristic of the actor in performance can cause the variety, 
and it is impossible to prevent the variety from occurring. In order to find cor- 
relations with easy analysis and with consideration of the variety, we convert 
motion data into strings of characters that symbolize contents and process those 
strings as patterns of characters. We can find patterns decreasing the effect of 
the variety, by adjusting the size of patterns. The detail is written in Sec. 3. 

3 Primitive motion: the index for hnman motion 

Our system converts multi-stream of motion into symbol multi-stream, because 
of running time for analysis algorithm on huge amount of data, such as motion 
data. Those symbols should: 

— be given to segments in motion multi-stream with consistent content 

— not be given by hand because the expression of human motion made by each 
person depends on his/her personal sense so much. 

First of all, our system executes the process, and we call this conversion content- 
based automatic symbolization [2]. Next, correlations are discovered from symbol 
multi-stream for body parts and combination of those are used as indices for 
motion. We call correlations primitive motions. 



3.1 Content-based automatic symbolization 

For the content-based automatic symbolization, we focused on that the change of 
content and the change of velocity on each body part happens simultaneously. 
The motion data is divided into segments [5] at those points where velocity 
changes. Those detected points are candidates for border points of a segment 
with consistent content. However, those detected points include noise made by 
unconscious movement which has nothing to do with the changes of contents. 
The unconscious movement is mainly caused by vibration of body parts. The 
noise is removed by considering the 3-D distances between points. 

Segmented data is clustered into groups according to their similarity. How- 
ever, even segmented data with same content have different time length, because 
nobody can acts in exactly the same manner. In order to calculates the similar- 
ity between time series patterns with different length, we employ Dynamic Time 
Warping (DTW) [1], which was developed in speech recognition domain. DTW 
calculates the similarity between patterns absorbing the difference on the time 
scale. 

Human voice has fixed number of consistent contents, phonemes, but human 
motion does not have pre-defined patterns. So it is unknown that how many 
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consistent contents exist. For the reason, we use a simple and powerful unsu- 
pervised clustering algorithm, Nearest Neighbor (NN) algorithm [3], The system 
clusters segments by using NN algorithm with DTW for it’s evaluation function, 
and motion data is automatically converted into symbol streams based on it’s 
content by using symbols that are given to clusters. 



3.2 Correlations in symbol multi-stream 

Correlations of contents exists on flow of time, so the symbol multi-stream is 
synchronized on time scale as shown in Fig. 3 (a). 
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Fig. 3. The process of multi-stream. 



Discovered primitive motions show frequency of occurrence of two symbol 
patterns. That means the system calculates probability of occurrence for a pat- 
tern B which occurs after a pattern A in certain blocks of interval. See the right 
hand side of Fig. 3 (a). Our system finds the pattern A which occurs more times 
than others and includes symbol “B”, “B”, “L”, “M” in two streams with right 
hand and elbow. After two blocks of symbols from the beginning of A, the sys- 
tem finds the same kind of pattern B which consists of “U”, “U”, “G”, “G”, in 
streams of left hand and elbow. The combination of occurrence of A and B occurs 
frequently in symbol multi-stream, and it means “Gontents “U” and “G” occur 
after contents “B”, “L” and “M” occur in the motion at higher probability’’, 
because symbols represent contents of motion(Fig. 3 (b)). 

In order to decrease the influence of the variety which we discussed in Sec. 2, 
we set the size of patterns A (Agi^g) and B and, the interval of A and B 

{int). Those sizes have a flexibility and the flexibility allows the system to And 
patterns A and B with the same interval, even if those sizes are set Agize = 2, 
Bsize = 2 and int = 2, or, Agize = 3, Bgize = 3 and int = 1. This flexibility 
decreases the influence of the variety. 

The system discovers many correlations from motion multi-stream and each 
of them has a degree of strongness to characterize the motion. Probability is 
an influential parameter to define the strongness and to select correlations as 
primitive motions for the motion. However, it still needs a deep consideration to 
define the strongness. 



4 Experiments 

We prepared 8 kinds of Japanese traditional physical exercise data as test data to 
discover primitive motions. Every motion consists of repetition of 2 or 3 times for 
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one kind. Test data are about 10 to 20 seconds long, and the sampling frequency 
is 120 times/seconds. Main movement of those test data concentrates on both 
arms and we had experiments for 4 streams of following body parts: right and 
left hands, and, right and left elbows. 

The system discovered primitive motions mainly in streams of right and left 
elbow. This is because elbows have less number of clusters than that of hands. 
To be precise, elbows have about 100 clusters and hands have about 200 clusters. 
The difference of numbers mean hands can express motion with wider variety 
than elbows do. The system discovered primitive motions such as: a) left elbow is 
on the way to go, after right elbow starts going down, b) after right elbow stops 
going up, left elbow starts going down and so on. These primitives occurred over 
40 or 50 times, and the probability of occurrence is from 0.3 to 0.4. 

Some other primitive motions kept the probability as high as 0.3, but those 
primitives occurred under 40 or 50 times. This shows that primitive motions 
with higher probability occurrence and with more frequent occurrence than oth- 
ers should be given to motion as indices. It is difficult to express features of 
motion exactly only by words, however, those discovered correlations were visu- 
ally understandable. This means our method is effective to discover indices for 
multi-stream of motion. 

5 Conclusion 

We proposed the method that finds correlations between body parts and be- 
tween contents in time flow of motion. Those correlations appears many times 
in multi-stream and motion data is represented with the combination of correla- 
tions. Correlations is basic elements of motion, and we call correlations primitive 
motions. Primitive motions can be indices for motion retrieval and recognition. 

In motion recognition domain, many researchers have focused on the extrac- 
tion of human motion information from time series of 2-D images. The extracted 
information does not possesses correlation and they do not process the informa- 
tion as multi-stream. From this point of view, our method is more effective to 
recognize and retrieve human motion data. 
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Abstract. In general, it is considered that pre-processings for data min- 
ing are necessary techniques to remove irrelevant and meaningless aspects 
of data before applying data mining algorithms. From this viewpoint, we 
have considered pre-processing for detecting a decision tree, and already 
proposed a notion of Information Theoretical Abstraction, and imple- 
mented a system ITA. Given a relational database and a family of possi- 
ble abstractions for its attribute values, called an abstraction hierarchy, 
our system ITA selects the best abstraction among the possible ones 
so that class distributions needed to perform our classification task are 
preserved, and generalizes database according to the best abstraction. 
According to our previous experiment, just one application of abstrac- 
tion for the whole database has shown its effectiveness in reducing the 
size of the detected decision tree, without making the classification ac- 
curacy worse. However, since such classification systems as C4.5 perform 
serial attribute-selection repeatedly, ITA does not generally guarantee 
the preservingness of class distributions, given a sequence of attribute- 
selections. For this reason, in this paper, we propose a new version of 
ITA, called iterative ITA, so that it tries to keep the class distributions 
in each attribute selection step as possibly as we can. 

1 Introduction 

Many studies on KDD, knowledge discovery in databases have concentrated on 
developing data mining algorithms for detecting useful rules from very large 
databases effectively. However those detected rules include even meaningless 
rules as well as meaningful ones. Thus, pre-processings for data mining are 
needed to exclude irrelevant and meaningless rules. There exists some techniques 
commonly used in the pre-processing[l]. For instance, an attribute-oriented in- 
duction used in DBMiner[2] is a powerful technique not only for preventing the 
mining task from extracting meaningless rules but also making the detected rules 
more understandable. 

We have already developed a system ITA {Information Theoretical Abstrac- 
tion)[3] based on the attribute-oriented induction and the information theory. 
ITA is based on the idea that the information gain ratio used in C4.5[4] can 
be also applicable to determine which abstraction preserves the necessary infor- 
mation in the attribute-oriented induction[2]. Given a relational database and a 
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set of possible abstractions for its attribute values, ITA selects an appropriate 
abstraction among possible ones and generalizes the database according to the 
selected abstraction in order to reduce its size, where an abstraction is said to 
be appropriate if for given target classes, the class distribution in the original 
database can be preserved in the resultant of generalization as possibly as we 
can. Therefore, we can obtain a compact (generalized) database still having an 
abstract class distribution closer to the original one. In our previous work, the 
original database D is first generalized according to an appropriate abstraction, 
and then the generalized one D' inputs to C4.5 to construct a decision tree VT' 
[3] . It has already empirically shown that the classification error of VT' is almost 
the same as one of a decision tree directly computed by C4.5 from the original 
database D. Nevertheless, the size of VT' is drastically reduced, compared with 

VT. 

Thus, just one application of abstraction for whole database has been exper- 
imentally shown its effectiveness in reducing the size of detected rules, without 
making the classification error worse. However, as C4.5 performs serial attribute 
selections repeatedly, ITA does not generally guarantee the preservingness of 
class distribution in each selection step. Hence, when we require that the clas- 
sification accuracy of VT' must be almost equal to VT, we can not allow the 
classification accuracy to go down even slightly. For this reason, in this paper, 
we propose a new version of ITA, called iterative ITA so that it tries to keep the 
class distribution for each attribute selection step as possibly as we can. 

As the result, the precision of detected rules will become much closer to one 
of C4.5, while keeping the same property of reducing the size of detected rules 
as to non-iterative ITA. 

2 Iterative ITA 

Figure 2.1 illustrates a generalization process in iterative ITA. Iterative ITA 
selects an appropriate abstraction in each attribute selection step and constructs 
a compact decision tree, called an abstract decision tree. That is, we propose to 
perform our generalization process in each attribute-selection step in C4.5, where 
an attribute based on which a decision tree is expanded. Each node Nj in a tree 
has a corresponding sub-database D^. of the original D, obtained by selecting 
tuples. For such a sub-database D^., C4.5 selects an another attribute Ai to 
furthermore expand the tree. We try to find an appropriate abstraction (p for 
that Ai so that the target class distribution given Ai values can be preserved 
even after generalizing the Ai values v to more abstract value (p{v) = dk. The 
generalized database is also denoted by = (p{Dn.) . Iterative ITA has the 
following features. 

— The process of constructing the decision tree in C4.5 divides the current node 
corresponds to D according to all attribute values {vl, . . . , , • • • , • • • , 

Cln the other hand, iterative ITA divides the current node ac- 
cording to all abstract concepts in the grouping {{r>J, . . . , },. . . . . . , 

= {di, . . . ,dm} = selected in each attribute selection step. 
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replacing attribute values with abstract 
concepts based on an appropriate abstraction. 

An original decision tree An abstract decision tree 




Fig. 2.1. An effect of the abstraction for the decision tree 



Therefore the number of the branches is m and iterative ITA reduces the 
branches in the original decision tree constructed by C4.5. 

Roughly speaking, we can say that a condition of terminating the expansion 
of the decision tree is concerned with a question of whether an expected 
classification error rate in the child nodes is small or not compared with 
the current node. A process of constructing the decision tree in our iterative 
ITA is similar to C4.5. Suppose that the expected classification error rate in 
some child node is smaller than the current node, and this relation between 
some child node and the current node holds after abstraction. Then the 
classification accuracy in D^. is improved. In such a case, iterative ITA 
continues a process of constructing the decision tree. Otherwise it terminates 
the expansion process. Since the expected classification error rate at abstract 
level is the average of those at concrete level, this error rate at abstract 
level turns out to be larger than the error rate at concrete level. In other 
words, a chance of improving the classification accuracy is lost by applying 
abstraction. 

Furthermore, suppose that the expected classification error rate in some 
child node is larger than the current node. This means that the condition 
for terminating the expansion process holds. Then, for any abstraction, the 
expected classification error rate at abstract level is also larger than the error 
rate at concrete level. In this case, iterative ITA terminates the expansion 
process. This is again because the expected classification error rate at ab- 
stract level is defined as the average of the error rate at concrete level. As 
a result, we can say that stopping condition for expansion at abstract level 
is satisfied whenever it does at concrete level, and the condition at abstract 
level tends to hold earlier than the concrete level. 
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change ratio change ratio 



(a) The size of decision trees (b) The error rate of decision trees 

Fig. 3.2. Sizes and error rates of abstract decision tree 

3 Experiment on Census Database 

We have made some experimentations using iterative ITA system. In our ex- 
perimentation, we try to generate decision trees from a Census Database in US 
Census Bureau found in UCI repository. Iterative ITA generates various compact 
decision trees, called abstract decision trees, by adjusting a threshold of a change 
ratio. The change ratio is the ratio of the information gain ratio after applying 
generalization one before applying generalization. We compare our abstract de- 
cision trees with a decision tree generated by C4.5. A size and an error rate of 
their decision trees are shown in Figure 3.2. 

From this observation, we consider that iterative ITA is useful for construct- 
ing a compact decision tree whose error rate is approximately equal to one be- 
fore generalization because the size has drastically decreased to about 1200 from 
about 6000, at the sacrifice of slightly increasing error rate to 0.167 from 0.132 
in the best case. 

4 Conclusion 

We can consider that the generalization method used in iterative ITA is useful for 
generating a very compact abstract decision tree, that is more understandable 
for almost users, whose regression of error rate is minimized among a given class 
of abstractions. 
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1 Introduction 

The linkage information is shown to be useful to find good Web pages at a search 
engine [5, 10]. But, in general, a search result contains several topics. Clustering 
Web pages enables a user to browse them easily. There are several works on clus- 
tering Web pages [7, 10-12]. In [9], we visualized Web graphs using spring model. 
But clustering is not enough to understand the topics of the clusters. Extraction 
of meta-data that explains communities is an important subject. Chakrabarti 
et al [6] used the terms in the small neighborhood around a document. Our 
approach is to combine the clustering and keyword extraction to interpret the 
communities. 

To find communities, we solve the eigensystem of the matrix made from the 
link structure of Web pages. To get characteristic keywords from found commu- 
nities, we use the algorithm developed in [1,2,8]. The input for the algorithm 
are two sets of documents - positive and negative documents. The algorithm 
outputs a pattern which well classifies them. This algorithm is robust for errors 
and noises, so that it is suitable for Web pages. The novelty of the keyword 
extraction algorithm is that keywords not only characterize one community but 
also distinguish the community from others. Thus, even if we fix a community, 
we have different characteristic keywords for the community according to the 
counter part. 

We found good characteristic keywords from two communities without seeing 
Web pages in them. We also show an experimental result in which different 
keywords are extracted according to the counter part. 

2 Preliminaries 

Our method has three basic steps. First, we collect a large number of Web pages. 
The second step is to find communities by solving eigensystem of the matrix 
made from the link structure of the collected pages. Kleinberg [10] showed some 
communities extracted from eigenvectors. The final step is to find characteristic 
keywords from communities found in the previous step. 

First, we define a Web graph and its matrix representation. A Web graph G = 
(V, E) is a directed graph such that (1) each node r; € U is labeled with a URL u 
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( label{v)= u), and (2) there exists an edge {v\,V 2 ) G E if and only if there exists 
a link from label (vi) to label (v 2 )- The adjacency matrix Mq = (niy) for a Web 
graph G is defined by rriij = 1 if (vi,Vj) G E and Wy = 0 otherwise. For a 
matrix M, M* denotes its transposed matrix. 



2.1 Community Discovery 

In [10], Kleinberg introduced the notion of authority and hub. Let G = (V,E) 
be a Web graph, and Xi and y, be an authority and hub weights, respectively, 
for a node Vi G V. His simple iteration algorithm updates both weights by 
Xi t— Vi "^(vi vj)eE^h normalizes at each iteration 

step. He shows that this iteration converges. 

Theorem 1 (Kleinberg [10]). (a;i, ... ,a;„) and (yi, . . . ,yn) eonverge the prin- 
eiple eigenveetors ofA(GyA(G) andA(G)A(Gy, respeetively, where A(G) is the 
adjaeeney matrix for G and the prineiple eigenveetor is one whieh has the largest 
eigenvalue. 

Let {xy... ,a;*) be the principle eigenvector. Then, the page with large value x* 
is a better page. Kleinberg [10] showed that {X^ ,YA) and {X~ ,Y~) form two 
communities, where (Xy,YA) are the most positive coordinates and {X~ ,Y~) 
are the most negative ones in some non-principal eigenvectors of A{GyA{G) 
and A{G)A{Gy. 

2.2 Keyword Discovery 

Arimura et al. [1,2] formulated the optimal pattern discovery problem and de- 
velop an efficient algorithm for it. The problem is, given two sets of documents, 
to find a pattern that classifies them. 

We briefly give a formulation of the problem. Let 5 = {si , . . . , s„) be a set of 
doeuments. An objeetive eondition over 5 is a binary labeling function ^ : 5 
{0, 1). For a document s, we call it a positive document if ^(s) = 1 and a negative 
document if ^(s) = 0. Pos and Neg denotes the sets of positive and negative 
documents, respectively. For a pattern tt and a document s, we define 7 t(s) = 1 
if 7T matches s and 7 t(s) = 0 otherwise. Let 5, = {s G 5'|7r(s) = i}{i = 0, 1). We 
define G(7 t) = ^(0o)|5'o| + where 6i = |5, n Pos|(i = 0, 1) and ip is an 

impurity function. In our experiments, we use ip{9) = —9\og9 — {l—9) log(l — 0). 

Definition 1. The optimal pattern diseovery problem is, given a set S of doe- 
uments and an objeetive eondition ^ over S, to find a pattern tt that mini- 
mizes G(tt). 

Although we can use a complex pattern in the above problem, in this paper, 
we simply use a substring of a Web page as a pattern because it requires small 
computational complexity. 
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3 Experiments 

To collect Web pages, we give a query to a search engine and get a set, called root 
set, of the top 200 Web pages. We use AltaVista^. Then, we enlarge the root set 
to base set according to [10]. For this purpose, we use the URL database in KN 
(Knowledge Network) [9]. We use the following restrictions when enlarge to base 
set: (1) only a URL whose prefix is “http” is used. (2) a link between pages on 
the same host is ignored. (3) for a URL in the root set, if there exist more than 
50 pages that point to the URL, then we choose only 50 pages randomly. In our 
experiments, we give a query of “+java +introduction” to AltaVista. Then, we 
enlarge the root set into the base set of 9,169 Web pages. 

3.1 Principal Eigenvector 

We implement the simple iteration algorithm in [10], which outputs the ranked 
lists of authorities and hubs. The lists correspond to the principal eigenvectors. 

The following table is a list of extracted keywords with the best 98(/100)^ 
authorities as positive documents. As negative documents, we use the worst 
90(/100) authorities and the best 34(/50) hubs. In general, the set of the au- 
thorities does not form a single community, but we expect that this experiment 
shows the power of the keyword extraction algorithm. 



negative documents 


positive keywords 


negative keywords 


worst 90 authorities 
best 34 hubs 


java, introduction 


national, institute, association 
developer’s, resource, guide, library 



In the first case, the algorithm finds positive and negative keywords that classifies 
two sets of documents. Especially, the query terms “java” and “introduction” are 
discovered from the positive documents. But, in the second case, the algorithm 
does not find good positive keywords although it finds good negative keywords. 
This is due to the number of negative documents. But, found keywords from the 
negative documents seem to show a property of hub pages. 

3.2 Non-Principal Eigenvector 

We also implement an algorithm in Fortran that solves the eigensystem. The 
table 1 shows extracted keywords from the two communities of the 2nd eigenvec- 
tors. The positive documents are 200 Web pages which are 100 pages of the most 
positive coordinates in the 2nd eigenvectors of both A(G)*A(G) and A(G)A(G)*, 
and the negative ones are also 200 pages which are 100 pages of the most nega- 
tive coordinates of them. We see many keywords related to the operation system 
name “OS/2 Warp” in the negative keywords. 

The two communities of the 2nd eigenvectors form the Web graph in Figure 1. 
The nodes at the right-hand side are the positive documents and those at the 
left-hand side are negative. 

^ http://www.altavista.com. 

^ We can get only 98 pages from the list of 100 URLs. 
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ranking Pi Ni extracted keywords 



1 


1 


53 


“os/2” 


2 


1 


37 


“warp” 


3 


0 


26 


“for os/2” 


4 


1 


27 


“os/2 warp” 


5 


0 


21 


“os/2</a>” 


6 


1 


24 


“driver” 


7 


19 


0 


“”#000000”” 


8 


0 


18 


“os/2 user” 


9 


0 


18 


“of os/2” 


10 


0 


18 


“device driver” 


11 


0 


17 


“freeware” 


12 


0 


16 


“unofficial” 


13 


16 


0 


“national accelerator” 


14 


0 


16 


“developer magazine” 


15 


0 


15 


“warped” 


16 


0 


15 


“warp 4” 


17 


0 


15 


“electronic developer” 


18 


15 


0 


“</ulXp>” 


19 


0 


14 


“warp</a>” 


20 


0 


14 


“user group” 



Table 1. Keywords extracted from the 2nd eigenvectors. 
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Fig. 1. Community extracted from the 2nd eigenvectors. 
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4 Conclusion and Discussion 

We showed some experimental results. We found good characteristic keywords 
from two communities without seeing Web pages in them. We also showed an 
experimental result in which different keywords are extracted according to the 
counter part. 

To obtain a complete list of communities, we need to solve eigensystem for 
9,169 X 9,169 matrix, which requires too much time and space. Applying linear 
algebra [3,4] to reduce the rank of adjacency matrix for clustering is further 
research. 
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We propose nonequilibrium thermodynamics of non-physical systems such 
as biological and economical ones ^ . We inductively construct it based on 
the Auto Regressive type models ^ derived from time series data. Since the 
detailed balance is generally broken in such non-physical systems, we extend 
the Sekimoto-Sasa theory ^ for systems with the circulation of fluctuations ^ . 
We apply our arguments to the time series data of nuclear reactor noise ® . 

We define effective potential, t/(x, a, c), and free potential, F(a, c), from the 
probability density of nonequilibrium steady state, 

Ps(x,a,c) = exp[F(a, c) - U(x,a,c)] = exp[F(a, c) - x • Gg(a, c)x/2] 

where a, c are coefficient vectors of ARMA model. While the reactivity,/?, so 
the coefficient vectors, are varying slowly, the first law is given as follows; 

< U{tf) >t, - < U{u) >t,= W+D, 

tf-i 

t^ti 

^H.H.Hasegawa T. Washio and Y.Ishimiya: Proceedings of the Second International Con- 

ference of Discovery Science, edts. Arikawa and Furukawa ,Springer-Verlag (1999) 326. 
Akaike; IEEE Transactions on Automatic Control, AC-19(I974)7I6. 

^K. Sekimoto: J.Phys. Soc. Japan, 66 (1997)1234; K. Sekimoto and S. Sasa: J.Phys. 
Soc. Japan, 66 (1997)3326; K. Sekimoto: Prog.Theor.Phys.Suppl, 180 (1998)17; S. Sasa: in 
private communication. 

^K. Tomita and H. Tomita: Prog. Theor. Phys. 51 (1974)1731; K. Tomita, T. Ohta and 

H. Tomita: Prog. Theor. Phys. 52 (1974)1744. 

^K. Kishida, S. Kanemoto, T. Sekiya and K. Tomita: J. Nuclear Sci.Tech. 13 (1976)161; 
K. Kishida, N. Yamada and T. Sekiya: Progress Nuclear Energy 1 (1977)247. 



J dx[U{t+ 1) - U{t)][P{x,t+ 1) + P{x,t)]/2, 
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t/-i 

u 

where < • >t= f ds. ■ P(x,t) is an average over an ensemble of the time 

series data. We can approximate the average as a time average over a single 
time series when the coefficient vectors a(t),c(t) change quite slowly in time. 

The work in the quasi-static process, Wqs, is given as difference between 
the initial free potential and the final one. The work in the near quasi-static 
process, ITarma, can be estimated using the ARMA model. We showed that 
fbARMA is bounded from the below by Wqs. The ARMA model is not valid 
in the nonstationary process, since many degrees of freedom are activated. We 
can expect the work in nonstationary process, Wns> is greater than Warma- 
We calculated the work from actual time series data of nuclear reactor noise. 
We estimated the effective potential using the time average in the stationary 
process, < >s=< >s and < X\X 2 >s- We assumed that the matrix 

G{t) in the potential linearly depends on time and the ensemble average can be 
approximated as the time average over the single time series. We estimated the 
work as Wns = [Gii(^/) - Gu(ti)] < x\ >„« - [Gi2(tf) - Gi2(ti)] < xiX 2 >ns- 
< xf >ns and < X1X2 >ns are the time average over the time series data in the 
nonstationary process. We obtained the following results; 



J dx[U{t -I- 1) -I- U{t)][P{x,t + 1) - P{x,t)]/2, 



p 


-2.19 ^ -1.25 


-0.588 ^ -0.289 


Wqs 


-0.8644 


-0.7764 


Warma 


-0.8639 


-0.7733 


ITns 


-0.8377 


-0.7001 


tf — ti 


41556 


14667 



These results are consistent with the second law given as the following inequality, 

Wns > Warma > Wqs = F{tf) — F{ti). 

In this paper we proposed the nonequilibrium thermodynamics of non-physical 
systems such as biological and economical ones. Random fluctuations in such 
system generally do not have to do with the thermal ones. We showed that it 
was possible to introduce a concept of effective potential and construct nonequi- 
librium thermodynamics. We inductively constructed it based on the Auto Re- 
gressive type models derived from time series data. 

We applyed our arguments to the actual time series data of nuclear reactor 
noise. Since the detailed balance is broken in the system of nuclear reactor, 
we used the extended Sekimoto-Sasa theory for the circulation of fluctuations. 
From the time series data in the stationary process, we define the effective 
potential and the free potential. The effective potential includes the effect of 
the circulation of fluctuations as the house keeping heat. In the thermodynamics 
of nonequilibrium steady states, the work in nonstationary process is bounded 
from the below by the difference of the initial free potential and final one. 

We thank Prof. S. Yamada for providing us the time series data of nuclear 
reactor noise. This work is partly supported by a Grant -in-Aid for Scientific 
Research on Priority Areas “Discovery Science” from the Ministry of Education, 
Science and Culture, Japan. 
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1 Introduction 

Hidden Markov Network (HMnet)[l] is a kind of statistical finite-state automa- 
tons. Discrete-type HMnet and NL-HMnet [2] can be used as a language model 
in a speech recognition system. NL-HMnet has no self-loop transition. The pro- 
posed cilgorithms for constructing Discrete-type HMnet and NL-HMnet require 
the pre-defined total number of states in HMnet. The optimum number of states 
can be changed by various factors such as kind of task, number of training sam- 
ples, and so on. It is desirable to automaticcilly determine the optimum number 
of states for each condition. 

The HMnet with the optimum number of states means one showing the 
minimum perplexity for the test samples. In other words, we can choose the 
HMnet with the minimum perplexity (is equal to the mtiximum likelihood) for 
the test samples. If we can estimate the likelihood for the test samples using the 
trciining samples, the HMnet with the optimum number of states can be chosen. 

In this paper, we have proposed an automatic determination algorithm for 
the optimum number of states. It can estimate the test set perplexity using only 
the training samples, so we can determine the optimum number of states in 
HMnet. 



2 The determination algorithm 

In order to estimate the test set perplexity using only the trciining samples, it 
can be thought to use a part of trciining samples as test samples. This idea 
was already used in the deleted interpolation algorithm[3] which was one of 
smoothing algorithms of n-gram. In this algorithm, the training samples are 
divided to two groups, a larger set and the remciining smaller set, n-gram is 
trciined using the larger set, and the test set perplexity is calculated using the 
remciining smaller set. These steps are repeated by changing the trciining set and 
the test set, and then the optimum coefficients for all trials is determined. 

If we use this idea to estimate the test set perplexity of the HMnet, we need 
to re-estimate the HMnet for each tried. However, as the re-estimation of the 
HMnet needs a large amount of calculation time, we cannot employ this idea. 
In this paper, instead of this idea, we propose another estimation algorithm of 
the test set perplexity. The new idea is to estimate the occurrence probability of 
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a single test sample generated from the unrelated path of the HMnet with the 
test sample. 

In the HMnet, each training sample is assigned to a single path. We Ccilculate 
P{x) for the training sample x using Eq. (1). 

1 ^ 

^ ( 1 ) 

y^x 

where, Path{y) denotes a single path to which a sample y is assigned, Probp{x) 
denotes the occurrence probability of a sample x generated from the path p. It 
can be assumed that P{x) is the occurrence probability of a single “test” sample 
X generated from the HMnet. Entropy P[ is calculated using Eq. (2). 

H = -^i2^ogiP{x)) (2) 

X 

where, W denotes the total number of words in the training samples. We can 
estimate the perplexity using Eq. (3). 

Perp = 2^ (3) 

In order to reduce the calculation time, we use Eq. (4) instead of Eq. (1) for 
P{x), because the total number of paths is less or equal to the total number of 
training samples. 



Probp{x) X num{p) (4) 

p^Path{x) 

where, num{p) denotes the number of training samples assigned to a path p, P 
denotes the total number of paths. 



3 Experiments 

The training and test samples were randomly generated from a finite-state au- 
tomaton. We used two automatons. One describes control commands for editing 
program(in Fig. 1), and it has 24 states and vocabulary size is 36. Another au- 
tomaton describes an airport-traffic control commands, and it has 64 states and 
vocabulary size is 59. The number of training samples were set to 2000, 5000 
and 10000, and the number of test samples was set to 20000. 

MDL(Minimum Description Length) criterion[4] is frequently used for choos- 
ing the optimum order of a statistical model. We compared performance of the 
proposed algorithm with MDL. MDL is calculated using Eq. (5). 

a 

MDL = —I + — log N + log I 



(5) 
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Table 1. Number of states and its perplexity (shown in a parenthesis) determined by 
each method. (Discrete-type HMnet) 



task 


^ samples 


minimum 


estimated 


MDL 


control 
command 
for editor 


2000 


40 (3.09) 


40 (3.09) 


31 (3.26) 


5000 


40 (3.01) 


40 (3.01) 


28 (3.09) 


10000 


50 (3.00) 


51 (3.00) 


76 (3.02) 


airport 

control 

command 


2000 


110 (7.85) 


108 (7.85) 


32 (12.3) 


5000 


125 (7.72) 


126 (7.72) 


61 (9.04) 


10000 


140 (7.51) 


152 (7.51) 


110 (7.78) 



where, I denotes a log likelihood for trciining samples, a denotes number of free 
parameters, N denotes number of training samples. I denotes number of model 
and is set to be constant in this experiments. 

Figure 2 shows the test set perplexity and the estimated one for the control 
commands for editing program. The number of training samples was set to 10000 
in this experiment. From these results, the test set perplexity was correctly esti- 
mated by the proposed cilgorithm. The minimum perplexities were 50 states for 
Discrete-type HMnet, and 25 states for NL-HMnet. We can choose the optimum 
number of states using the estimated perplexity. 

Figure 3 shows results with MDL in the same conditions. MDL gave 75 states 
for Discrete-type HMnet, and 40 states for NL-HMnet. The MDL shows larger 
numbers than the optimum numbers of states. 




Fig. 1. Grammar structure of control commands for editing program 
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Table 2. Number of states and its perplexity (shown in a parenthesis) determined by 
each method. (NL-HMnet) 



task 


^ samples 


minimum 


estimated 


MDL 


control 
command 
for editor 


2000 


25 (3.09) 


23 (3.09) 


26 (3.09) 


5000 


25 (3.02) 


23 (3.02) 


28 (3.08) 


10000 


25 (3.00) 


23 (3.00) 


42 (3.23) 


airport 

control 

command 


2000 


55 (7.02) 


52 (7.19) 


41 (7.56) 


5000 


55 (6.84) 


55 (6.84) 


99 (34.5) 


10000 


55 (6.82) 


57 (6.82) 


100* (24.5) 



Table 1 shows the number of states and its perplexity (shown in a parenthesis) 
determined by each method for Discrete-type HMnet, and table 2 shows those for 
NL-HMnet. “minimum” means the minimum perplexity computed using the test 
samples, “estimated” means the minimum perplexity computed by the proposed 
algorithm. From these tables, the proposed cilgorithm can determine the opti- 
mum number of states in every condition. On the other hand, MDL shows larger 
difference from the optimum number of stated shown in ’’minimum”. In table 
1, there is small difference between the perplexity with MDL and the minimum 
perplexity. However, table 2 shows larger difference between them. 




Fig. 2. Test set perplexity and estimated one. 
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4 Conclusion 

We have proposed a method to determine the optimum number of states in 

Discrete-type HMnet and NL-HMnet. This algorithm can correctly estimate the 

test set perplexity. From experimental results, it can determine the optimum 

number of states even if MDL criterion cannot work. 
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1 Introduction 

Two image processing expert systems, IMPRESS[1] and IMPRESS-Pro[2] have 
already been developed by our research group. These systems automatically 
generate image processing procedures from sample pairs of an image and a sketch 
representing an object to be extracted from it. Automatic acquisition of the 
image processing procedure is a kind of knowledge discovery process, because an 
image processing procedure which can extract a figure like the given sketch from 
the image can be regarded as a representation of knowledge about the object in 
the image. In this paper, we examine above two expert systems comparatively 
from the knowledge discovery viewpoint. 

2 Comparison between IMPRESS and IMPRESS-Pro 

In the comparison, we use normal and abnormal images of LSI packages and 
sketch figures representing defect parts of the abnormal images. The outlines 
of IMPRESS and IMPRESS-Pro are shown in Eig.l. The input of IMPRESS 
is a set of sample pairs of an abnormal image and its sketch figure, and that 
of IMPRESS-Pro is a set of sample images consisting of normal and abnormal 
images and sketch figures. Each sketch figure is a binary image. IMPRESS can 
generate a procedure to extract the shape of sketch as precisely as possible. 
On the other hand, IMPRESS-Pro can generate a procedure which meets the 
demand about the misclassification rate per image. 

Each system has in its database a sequence of local processes with kind of 
filters and parameters unfixed. While the sequence of local processes in IM- 
PRESS consists of [smoothing-differentiation]-[binarization]-[connected compo- 
nent processing], that in IMPRESS-Pro consists of [smoothing-differentiation]- 
[binarization]- [classification of connected components]. In the process of each 
system, filters and parameters are fixed sequentially with evaluating the real- 
izable performance for each stage. Einally, they generate their own best image 
processing procedures based on the each criterion for evaluation. The procedures 
acquired by two systems are expected to be different each other. 

S. Arikawa and S. Morishita (Eds.): DS 2000, LNAI 1967, pp. 311-314, 2000. 
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image database 




Fig.l Outlines of IMPRESS and IMPRESS-Pro. 



3 Experiments 



The image processing procedures acquired by IMPRESS and IMPRESS-Pro 
from a sample set (eighteen abnormal and normal images of LSI packages) 
are shown in Table 1. The requirements for IMPRESS-Pro are (Ealse Positive 
rate)<7% and (Ealse Negative rate)<7%. The procedures acquired by two sys- 
tems are different each other. Eig.2 and table 2 show the results of application 
of two procedures in table 1 to a test set (eighteen abnormal and normal real 
images). This result indicates that the procedure acquired by IMPRESS is more 
suitable than that by IMPRESS-Pro to extract the sketch Rgure with its shape as 
precisely as possible . On the other hand, the procedure acquired by IMPRESS- 
Pro is more effective than that by IMPRESS to control the misclassihcation 
rates. 

The criterion used in IMPRESS in selecting [smoothing-differentiation] pro- 
cess is the index of separation calculated by comparing filtered pixel values in the 
sample figure with those in the background area. In IMPRESS-Pro, the number 
of false positive connected components is used as the criterion. Pig. 3 shows the 
joint distribution of evaluation values calculated by these criteria for all proce- 
dures. The procedures acquired by both systems are considered to be basically 
different, because the correlation between the values evaluated by two criteria is 
low. 
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(a-1) original image (normal) 




1 






(a-2) result (IMPRESS) 


(a-3) result (IMPRESS-Pro) 


' -'T.- ^ .'-J 




i 


(b-1) original image (abnormal) 


(b-2) sample figure 


P 

i 




i 


(b-3) result (IMPRESS) 


(b-4) result (IMPRESS-Pro) 



Fig. 2 Experimental results of applying the procedure to the test set. 



4 Conclusion 

In this paper, we compared two image processing expert systems from the view- 
point of automatic knowledge acquisition from images. It was conhrmed that 
IMPRESS and IMPRESS-Pro generate the different procedures each other be- 
cause of the different criteria for evaluation . Euture research will be to examine 
the effect of search space reduction theoretically. 
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Table 1 The acquired procedures. 





IMPRESS 


IMPRESS-Pro 


Smoothing 


5x5 


3x3 


Differentiation 


8-Laplacian(r:41) 


8-Laplacian (r:60) 


Binarization 


22 


36 




Close-Open, 
Small component 
elimination: 34pixels 


threshold of 
likelihood ratio: 
0.161 



Table 2 The performance of acquired procedures. 




IMPRESS 


IMPRESS-Pro 


degree of coincidence 


64% 


47% 


False Positive image 


33% 


0% 


False Negative image 


6% 


17% 




Criteria for IMPRESS (index of separation) 

Fig. 3 The distribution of evaluation values for all 
procedures searched by the system. 
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1 Introduction 

Objects of many studies of authorsliip attribution have been text data in which 
boundaries between words are obvious [1] [2]. When we apply these studies to 
languages hi which sentences could not be divided obviously into words, such as 
Japanese or Chinese, prehmhiary processmg of text data such as morphological 
cuialysis is retjuired and may mflucnce the final results. The methods which make 
use of characteristics of particular languages or particular compositions also 
have Ihnited coverage [J]. Extracting authors’ characteristics from sentences is 
generally an unsolved problem. Therefore, we propose a method for authorship 
attribution based on distribution of n-grams of chcuracters in sentences. The 
proposed method can analyze sentences without any additioned uiformation, 
i.e. prelhiunary analyses. The experhiients, where 3-grams to represent author’s 
characteristics were educed on the basis of their distributious, are also reported 
hi the foUowhig. 

2 Authors’ Characteristics in N-gram Distribution 

Firstly, we hitroduce a measure of disshnilarity between two sets of text data. 
Let us assume that a small value of this measure suggests the high probability 
that a single author wrote the texts wliich are compared. A string x of n charac- 
ters, i.e. an ii-gram, is expressed as x=XiX 2 • - -Xn, where xt is in an alphabet of 
the target language for n. The function P(x) represents a probability 

distribution function of ii-grams x in text P. We take notice of probability of 
appearance, not conditional probability at which Xn appeai-s after 
Where text P and Q are given, we define set (7 by C = {x|F(x)Q(x) ^ 0} . C is 
a set of n-graiiis wliich appear hi both of text P and Q, i.e. coiiunon n-grams. We 
measure resemblance between text P and text Q with function dissiin defined 
by dissim(P,Q) = Exec . Here, “card(C')” means the 
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number of elements belonging to the set C. Disaim is synunetrical concerning F 
and Q. If only P(x) is the same as Q(x), the value of dissim becomes 0. Thougli 
there arc many studies on functions to measure a resemblance between proba- 
bility distribution, we take divergence (Kullback-Leibler information) as a com- 
parable measure, which is defined as D(F\\Q) = F{x) k>g(F{x)/Q{x)). 

3 Experiments on Authorship Attribution 

The objects of analyses are 92 works in toted, i.e. 72 novels, 9 essays, -5 letters, 
3 scenarios, 2 diaries, and 1 speech, written from the Aleiji period^ to the 
early Shouwa period" . The following list represents the authors’ names and the 
numbers of their works analyzed. 

Doppo KUNIKIDA; 11, Hiroshi(Kan) KlKUCHl: 1-5, Icliiyou HlGUCHl: 7, 
Kidou OKAAIOTO: 9, Motojirou K,\J11: 20, Rymiosuke AKUTACIAWA; 10, 
Senko AllZUNO: 7, Takeo ARISHIMA: 13 

The total number of characters included hi the 92 works is 3,001,6-50. All the 
works are downloaded from the Aozora Bunko^ . The 67 works of them follow 
the modern prescription of notation, i.e. modern kana'^ . The others are written 
hi the old prescription, i.e. historical Aana. 

A blank and a Ihie feed are counted as a character. Redundancies, i.e. lines 
which don’t hiclude any characters, blanks after line feeds, are ignored. Charac- 
ters which can be encoded in the JIS X 0208 are all counted. Each of the other 
characters is replaced with a single tag, which is counted as a character. 

Crcdibihty of disshn declines when volumes of texts to be compared differ 
greatly. Works are randomly combhicd until volumes of connected texts become 
30,000 characters. The works longer than 30,000 characters were analyzed with- 
out combhialion. After consideration on the result of pre-experhnents reported in 
[4], the volume of 30,000 characters was chosen. In order to observe the relation 
between volumes of texts and accuracy of authorsliip attribution, accuracy was 
also counted when texts from of 10,000 to of 30,000 characters were compared. 
Because accuracy might depend on way of combination of works, -50 ways of the 
random combination were tried and 50 text-sets to analyze were generated. To 
discuss what n of u-gram distribution is effective, the results of analyses from 
via 1-gram distribution to via 10-gram distribution are to be reported. 

Here is a measure of accuracy of authorship attribution via percentage of 
success cases to all the comparisons in a text set. The number of comparisous is 
equal to the number of ways to choose a text as an origui in a comparison, i.e. 
the number of texts to be compared. A success case is defined as a case where all 

‘ The Ahiji period is from 1868 to 1912. 

^ The Shouwa period is from 1926 to 1989 
* http: / / www.aozora.gi'. jp/maui.html 

A virtual hbrary on the WWW wliicli provides mainly hteratm'es whose copyrights 
liave termhiated. 

■* Jaixmese x’honogram 
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the values of dissimilarity between the origin text and the same author’s texts 
are smaller than the minimmn dissunilarity between the origin text and dilTerent 
authors’ texts. In a success case, when the dissimilarity of a text T is smaller 
than that of a text of the origin author, i.e. the author of the origin text, we 
can tell that text T is by the origui author. When a text-set has pans of texts 
where any common n-grams are not found, authorship attributions which deal 
with the text-set are coiLsidered as fail cases. 

In Figure 1, we can see the arithmetic means of accuracy via dissim and 
via divergence througli the test on all the 50 text-sets. On dissim, 3-gram Dis- 
tribution shows the highest scores and the maxunum value was 0t).0%, which 
was achieved with 30,000 character-long texts. Though no tecluiiques equiva^- 
lent to smoothuig or flooring eire adopted, accuracy of divergence is cjuite low. 
Dissim h'oni with 2-gram distribution to with 4-gram distribution always shows 
better results than divergence with any u-gram distribution. Average accuracy 
via dissim tends to increase accordmg to the increase of volumes of texts to 
be compared. Therefore, the credibihty of dissim is supposed to unprove when 
longer texts are analyzed. 




uf texts to be compared (characten) 



average accuracy via divergence 
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Fig. 1. Average accurcicy of authoisliip attribution via dissim and via divergence 
(The aritlunetic means of the accuracy on -50 text-set are represented.) 



4 Extraction of Authors Characteristic N-grams 

In this section, we try to extract 3-grams which represent characteristics of 
Ryunosuke AKUTAGAWA and Hiroshi KUvUCIIl from a set of combined texts 
of r2,500-character-long. The text-set is one of the shortest sets where perfect 
authorship attributions were attained via 3-gram distribution in the exj>erimeiits 
hi section 3. Because the works analyzed in this section are published hi 11 years 
from 1917 to 1927, the general change of hterary style did not have a great effect 
on differences between these two authors’ styles. 

The effect on dissim by u-gram x is defined by s/iare(x) = |logP(x)/Q(x)| . 
N-grams whose distribution is distuictive of an author are supposed to represent 
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the author’s characteristics. When we sorted .3-grains in order of s/iafe(x), the 
value of shav€(x) changed drastically around the mode^ . Though the percentage 
of n-grains which show share (x) less than or eciual to the mode was 38.7% hi 
average (standard variance was 2.63 points), the sum of their shar€[x) was 
only 0.23% to value of dissim in average (standard variance was 0.14 pouits). 
Therefore, we define characteristic n-grains in a com]iarison between texts as 
n-grams x such that share{x) is greater than the mode. 

The sets of characteristic n-grams in comparisons among texts of a single 
author represent the differences in contents, not hi styles. They are named con- 
tent constrahicd n-grams and should be ehminated from authors’ characteristic 
n-grams, which represent authors’ characteristics. If the author who uses the n- 
gnuiis more frequently alters case by case, the n-grams should also be ehminated 
because they are difficult to represent author’s characteristics. N-grams which 
appear as characteristic n-grams in only one comparison are illfounded, so that 
they should be also ehminated. 

Set A is defined by the union of the sets of characteristic n-grams in com- 
parisons between the two authors. Set 13 is the union of the sets of content 
constrained n-grams. When A fl is considered as authors’ characteristic 3- 
grams, the 196 kinds of Authors’ characteristic 3-grams were educed. The top 
20 kuids of 3-grams of the greatest share{x) are shown hi Figure 1. The top 20 
kinds of 3-grams whose probabihties of appearance are the greatest hi sentences 
of each author are also listed in Figure 1. If differences between the modern 
prescription of notation and old one are taken hito consideration, the 12 kinds 
of 3-gram of the greatest probabilities are conunon to the two authors. Rirther- 
riiore, they consist of parts of functional words which are supposed to appear 
frequently hi general sentences. 3-grams of greater probabihties caimot be clues 
to disthiguish each author’s characteristics. On the other hand. Authors’ char- 
acteristic 3-grams hiclude parts of meanhigful words and don’t contahi n-gi-ams 
conunon to the 40 kinds of ^3-grams of the greatest probabilities. These Authors’ 
characteristic .3-grams will contribute not only to the improvement of authoiship 
attribution methods but also to the studies of hterature or stylistics. 



5 Conclusion and Future Work 

Dissim achieved average accuracy of 96.0% at highest via ;3-gram distribution. 
Dissim could extract authors’ characteristics conunon to works hi various gen- 
res. Divergence was not effectiw in these experhiients. We suppose that dissim 
achieved higli accm’acy because (1) dissim is normalized on the number of coni- 
iiioii 11 -grams, (2) absolute values of ratios of P(x) to Q(x) are added in dissim. 

Our study on extraction of authors’ fingerprints is hi the stage of e.xamhi- 
hig validity of the proposed method. Texts of more authors, of more languages 
and of more khids should be taken into experhiients. More random combina- 
tions to coimect compositions should also be experhneuted. Thereafter, we will 
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Table 1. Author’s clicU'acleristic 3-grams and 3- grams of the greatest appearance prol> 
abihties (’“©n” meixns a carriage return.) 





Authoi's’ Characteristic .'J-grams 
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be able to give atiswers to the question that why dissim represented the difier- 
eiice between authors’ styles better than divergence. These conclusions will be 
meaningful clues when we consider what are desirable characterLstics for smii- 
larity nieasui’es among probability distribution in the area of natural language 
processing. The final aun of our study is automatic discovery of methods for 
graspuig characteristics from general or particular sentences. 
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Abstract. With the variety of human life, people are interested in var- 
ious matters for each one’s unique reason, for which a machine maybe 
a better counselor than a human. This paper proposes to help user cre- 
ate novel knowledge by combining multiple existing documents, even if 
the document collection is sparse i.e. if a query in the domain has no 
corresponding answer in the collection. This novel knowledge realizes an 
answer to a user’s unique question, which can not be answered by a sin- 
gle recorded document. In the Gombination Retriever implemented here, 
cost-based abduction is employed for selecting and combining appropri- 
ate documents for making a readable and context-reflecting answer. Em- 
pirically, Gombination Retriever obtained satisfactory answers to user’s 
unique questions. 



1 Introduction 

People are interested in personal and unique matters, e.g. very rare health con- 
dition, friction with friends, etc. They often hesitate to consult a human about 
such unique matters, and worry in their own minds. In such a case, entering such 
interests to a search engine and reading the output documents is a convenient 
way which may serve satisfactory information. 

However, a document collection of a search engine, even though they may 
seem to include a lot of documents, is too sparse for answering a unique question: 
They have only past information not satisfactory for answering novel queries. For 
overcoming this situation, a search engine is desired to help user create knowledge 
from sparse documents. 

For this purpose, we propose a novel information retrieval method named 
combination retrieval. The basic idea is that an appropriate combination of 
existing documents may lead to creating novel knowledge, although each one 
document may be short of answering the novel query. Based on the principle 
that combining ideas triggers the creation of new ideas [1], we present a sys- 
tem to obtain and present an optimal combination of documents to the user, 

* e-mail:matumura@miv.t.u-tokyo. ac.jp 
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optimal in that the solution forms a document-set which is the most readable 
(understandable) and reflecting the user’s context. 

The remainder of this paper goes as follows: In Section 2, the mechanism of 
the implemented system Combination Retriever is described. We show the ex- 
periments and the results in Section 3, showing the performance of Combination 
Retriever for medical counseling question-and-answer documents. 



2 The Process of Combination Retriever 

For realizing combination retrieval, we need a method for selecting meaningful 
documents which, as a set, serve a good (readable and meaningful) answer to the 
user. Here we show our approach implemented as a system called Combination 
Retriever, where abductive inference is used for the selection of documents to be 
combined. The process of Combination Retriever is as follows: 

The process of Combination Retriever 
Step 1) Accept user’s query Qg. 

Step 2) Obtain G, a word-set representing the goal user wants to understand, 
from Qg {G = Qg if Qg is given simply as a word-set). 

Step 3) Make knowledge-base E for the abduction of Step 4). For each doc- 
ument Dx in the document-collection Gdoc, a Horn clause is made as to 
describe the condition (words needed to be understood for reading Dx) for 
and the effect (words to be subsequently understood by reading Dx) of read- 
ing document Dx- 

Step 4) Obtain h, the optimal hypothesis-set which derives G by being com- 
bined with E, by cost-based abduction(CBA, hereafter) [2] . h obtained here 
represents the union of following information, of the least size of K. 

S: The document-set the user should read. 

Ki The keyword-set the user should understand by other information source 
than the document collection Gdoc, for reading the documents in S. 
Step 5) Show the documents in S to the user. 

The intuitive meaning of the abductive inference is to obtain the conditions 
for understanding user’s goal G. Those conditions include the documents to 
read (S) for understanding G, and necessary knowledge (K) for reading those 
documents. That is, S means the document-combination we aim to present to 
the user. 



2.1 An Example of Combination Retriever’s Execution 

For example. Combination Retriever runs as follows. 

Step 1) Qg = “Does alcohol cause a liver cancer ?” 

Step 2) G is obtained from Qg as {alcohol, liver, cancer}. 




322 



Naohiro Matsumura and Yukio Ohsawa 



Step 3 ) From Cdoc, documents Di,D 2 , and D 3 are taken, each including terms 
in G, and put into Horn clauses as: 

alcohol '.—cirrhosis, cell, disease, Di . 
liver '.—cirrhosis, cell, disease, D\ . 
alcohol '.—marijuana, drug, health, D2. 
liver '.—marijuana, drug, health, D2. 
alcohol '.—cell, disease, organ, D3. 
cancer -.—cell, disease, organ, D3. 

Hypothesis-set H is formed of the conditional parts here, of Di , D2 and £>3 
of Type 1 ^ each weighted 0, and “cirrhosis,” “cell,” “disease,” “marijuana,” 
“drug,” “health,” and “organ” of Type 2 ^ each weighted 1. 

Step 4 ) h is obtained as S' U iF, where 

S = { Di, £> 3 } and 
K = {cirrhosis, cell, disease, organ}, 

meaning that user should understand “cirrhosis,” “cell,” “disease” and “or- 
gan” for reading £>i and £> 3 , served as the answer to Qg. This solution is 
selected because cost{h) takes the values of 4, less than 6 of the only alterna- 
tive feasible solution, i.e. {marijuana, drug, health, cell, disease, organ} 
plus {£> 2 , £> 3 }- 

Step 5 ) User now reads the two documents presented as: 

£>i (including alcohol and liver) stating that alcohol alters the liver func- 
tion by changing liver cells into cirrhosis. 

D3 (including alcohol and cancer) showing the causes of cancer in vari- 
ous organs, including a lot of alcohol. This document recommends drinkers 
to limit to one ounce of pure alcohol per day. 

As a result, the subject learns that he should limit drinking to keep liver 
healthy and avoid cancer, and also came to know that other tissues than 
liver get cancer from alcohol. 

Thus, user can understand the answer by learning a small number of words 
from outside of Cdoc, as we aimed in employing CBA. More importantly than 
this major effect of Combination Retriever, a by-product is that the common 
hypotheses between £>i and £> 3 , i.e., {cell, disease} of Type 2 are discovered as 
the context of user’s interest underlying the entered words. This effect is due to 
CBA which obtains the smallest number of involved contexts, for explaining the 
goal (i.e. answering the query), as solution hypotheses. Presenting such a novel 
and meaningful context to the user induces the user to create new knowledge 
[5], to satisfy his/her novel interest. 

^ Hypothesis that user reads a document in Cdoc. 

^ Hypothesis that user knows (learns) a conditional term in Cdoc. 
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3 Experimental evaluations 

3.1 The experimental conditions 

Combination Retriever was applied to Cdoc of 1320 question-answer pairs from a 
health care question answering service on WWW {Alice, http://www.alice.columbia.edu). 
Past clients of Alice asked about personal anxiety or interest in health and a 
medical counselor answered them. 
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0323 — (The question to this answer is here) 

Dear AGO A, 

alcohol, even in relative^ small amounts, can alter liver function. With continued use of alcohol, liver 
cells are damaged and ^en progressively destroyed. The destroyed cells are often replaced by 
fibrous scar tissue, a condition known as cirrhosis of the liver. As cirrhosis develops, the individual 
may progressive^ lose his or her capaci^to tolerate alcohol, because there are fewer remaining liver 
cells to metabolize whatever alcohol is in the bloodstream. Cirrhosis has tht potential to be a fatal 
disease. For more information, askyour oryour grandmo^er’s doctor, or call Adah Children of 
alcoholics (ACOA) at (212) 316-391C. 



AUce^ 



0613 (The question to thio onswer io here) 



Dear Worries about cancer and diet, 



Cancer is not one disease, but is actually a grotqt of diseases caused by the unrestrained growth of 
cells in one of the body's organs or tissues. Which people get cancer at what times, in which organs, 
is still somewhat of a mystery. One factor that increases a person’s risk of contracting cancer is 
genetic makeup. Environmental tnggers (ie. food choices, sunlight, alcohol, viruses, tar in tobacco 
smoke, pollutants in the air) also play a part in cancerous formations. Although is difficuhto 
estimate which of these triggers cause cancer in susceptible individuals, estimates have been made. 






r 



B I 



Fig. 1. An output of Combination Retriever, showing two past answers 0323 and 0613 
(document IDs in Cdoc) for input query {alcohol, cancer, liver}. 



3.2 Result Statistics 

The test was executed for 5 subjects from 21 to 30 years old accustomed to using 
a Web browser. This means that the subjects were of the near age to the past 
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question askers of Alice and entered queries into the CGI interface and read the 
answers of Combination Retriever smoothly. 

Here, 63 queries were entered. This seems to be quite a small number for 
the evaluation data. However, we compromised with this size of data for two 
reasons. First, we aimed at having each subject evaluate the returned answer in 
a natural manner. That is, in order to have the subject report whether he/she 
was really satisfied with the output of Combination Retriever, the subject must 
enter his/her real anxiety or interest. Otherwise, the subject has to imagine an 
unreal person who asks the query and imagine what the unreal person feels 
with the returned answers. Therefore we restricted to a small number of queries 
entered from real interests. Second, the results show significant superiority of 
Combination Retriever as follows, even though the test data was small. 

The overall result was that Combination Retriever satisfied 43 of the 63 
queries, while VFAQ satisfied only 26 queries. Next, let us show more detailed re- 
sults, which show that Combination Retriever works especially for novel queries. 



4 Conclusions 

We proposed to help user create novel knowledge, by combining and presenting 
multiple existing documents. This novel knowledge realizes an answer to user’s 
unique question, which can not be answered by a single document. 

This high-performance comes from obtaining minimal-cost hypothesis in CBA. 
That is, a document-set in a meaningful context can be obtained, because CBA 
discovers relevant context according to user’s query, by minimizing the number 
of conditional terms for reading output documents. This means that the user 
and the system can ask and answer under a meaningful context, which sup- 
ports a meaningful communication. From such a novel and meaningful context 
presented, the user can create new knowledge which realizes a satisfaction of 
his/her unique interest. This is a significant by-product of minimizing the cost 
of output-documents for obtaining an answer easy to read. 
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1 Introduction 

The discovery of a numeric law, e.g., Kepler’s third law T = from data 

is the central part of scientific discovery systems. The data considered in many 
real fields usually contains both nominal and numeric values. Thus, we consider 
discovering a law that governs such data in the form of a rule set of nominally 
conditioned polynomials. Recently, a connectionist method called RF6 [1] was 
proposed to solve problems of this type. RF6 can learn multiple nominally con- 
ditioned polynomials with single neural networks; besides, RF6 can discover gen- 
eralized polynomials whose power values are not restricted to integers. However, 
for real complex problems, RF6 will suffer from a combinatorial explosion in the 
process of restoring rules from a trained neural network. Therefore, this paper 
proposes a new version of RF6 by greatly improving its procedure of restoring 
nominally conditioned polynomials from a trained neural network. 

2 Restoring Nominally Conditioned Polynomials 

Let {qi, - ■ ■ ,qKi,x\, - ■ ■ ,XK 2 iy\ be a set of variables describing an example, 
where qk is a nominal explanatory variable, Xk is a numeric explanatory variable 
and y is a, numeric target variable. For each qk we introduce a dummy variable 
expressed by qu, i.e., qu = \ii qk matches the Fth category; otherwise 0, where 
I = 1, • • • , Lk, and Lk is the number of distinct categories appearing in qk- As a 
true model governing data, we consider the following set of multiple nominally 
conditioned polynomials whose power values are not restricted to integers. 

,r K2 . 

if f\ Qki = l then y{x; 0*) = -Wq + ^ , i =!,■■■, I* (1) 

qki&Q' i=i 

where I* is the number of rules, Q* denotes a set of dummy variables corre- 
sponding to the t-th nominal condition and 0 * is a parameter vector used in the 
i-th generalized polynomial. Here, each parameter wt or vjtj. is a real number, 
and J* is an integer corresponding to the number of terms. 

S. Arikawa and S. Morishita (Eds.): DS 2000, LNAI 1967, pp. 325-329, 2000. 
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Consider a function c{q;V) = where V denotes a 

vector of parameters Vki- We can show [1] that with an adequate number J, cer- 
tain type of neural network y{q, x; 0) = wo + J2j=i Qki + 

^^1 Wjfe Inxfc) can closely approximate Eq. (1). Let D = {{q>^ ,x>^ ,y>^) : y = 
1, • • • , iV} be a set of training data, where N is the number of examples. Then, 
each parameter can be estimated by minimizing an objective function £(0) = 
— y{q^, x^\ 0)Y + 0^A0, where a penalty term is added to improve 
both the generalization performance and the readability of the learning results. 
2.1 Restoring Procedures 

Assume that we have already obtained a neural network trained as the best 
law-candidate. In order to restore a set of nominally conditioned polynomials as 
described in Eq. (1), we need a suitable efficient procedure. 

RF6 [1] has a decomposition procedure for this purpose; i.e., a set of nom- 
inally conditioned terms is extracted from each hidden unit, and then each of 
these terms is in turn combined through all of the hidden units. When a de- 
notes the average number of terms over each hidden unit, the total number of 
these combined terms approximately amounts to . Thus, as the number of 
hidden units or the number of nominal variables increases, this procedure comes 
to suffer from a combinatorial explosion. 

As another approach, we can extract nominally conditioned polynomials for 
each training example, and simply assemble them to obtain a final set of rules 
as a law. Then, the following set of nominally conditioned polynomials can be 
obtained directly from the training data and the trained neural network. 

.7 K2 

if f\ qu = l then J/ = wo + X! II (2) 

where denotes the j-th coefficient calculated from the nominal values of the 
/i-th training example, i.e., = Wj However, in com- 

parison with the true model governing the data defined in Eq. (1), the results of 
this naive procedure can be far from desirable because they will contain a large 
number of similar polynomials, and each nominal condition will be too specific 
in terms of representing only one training example. 

Based on the above considerations, we propose a new restoring procedure, 
stepl. finding subspace representatives 

In order to find subspace representatives, a set of coefficient vectors {c^ = 
: y = I,---,A^} calculated from the training data is quantized 
into a set of representative vectors {r* = (rj, • • • , : i = 1, •••,/}, where 

/ is the number of representatives. Among several vector quantization (VQ) 
algorithms, we employ the k-means algorithm due to its simplicity. 
step2. criterion for model selection 

Consider the following set of rules using the representative vectors. 

,7 K2 

ifi{q)=i then y = wo + ^ i=^, ■■■,!, (3) 

i=l k=l 
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where i{q) denotes a function that returns the index of the representative vec- 
tor minimizing the distance, i.e., i{q) = argmin^ Here, since 

each element of c is calculated as Cj = wj exp{^^^^^f2i^jkiqki), Eq. (3) 
can be applied to a new example, as well as the training examples. Thus, 
to determine an adequate number I of representatives, we employ the pro- 
cedure of cross-validation which divides the data D at random into S dis- 

(s) 

tinct segments {17® : s = 1, • • • ,S}. Namely, by using the final weights 0 
trained without data segment 17®, we can define a cross-validation error func- 
tion cv = iv-i Eti - (4^^ + EU 

step 3. generating conditional parts 

The indexing functions |*(g)} described in Eq. (3) must be transformed into a 
set of nominal conditions as described in Eq. (1). One reasonable approach is to 
perform this transformation by solving a classification problem whose training 
examples are {(q^,i(q^)) '■ M = 1 ; ’ ’ ’ ) -^}> where i(q^) indicates the class label 
of a training example q^. For this classification problem, we employ the c4.5 
decision tree generation program due to its wide availability. From the generated 
decision tree, we can easily obtain the final rule set as described in Eq. (1). 

Clearly, these steps can be executed within the computational complexity of 
linear order with respect to the numbers of training examples, variables, hidden 
units, representatives, iterations performed by the k-means algorithm, and data 
segments used concerning cross-validation; i.e., this new restoring procedure can 
be much more efficient than the old decomposition procedure which requires 
the computational complexity of exponential order. Hereafter, the law discovery 
method using the above restoring procedure is called RF6.2. 

2.2 Evaluation by Experiments 

By using three data sets, we evaluated the performance of RF6.2. In the k- 
means algorithm, initial representative vectors |r*} are randomly selected as a 
subset of coefficient vectors |c^}. For each I, trials are repeated 100 times with 
different initial values, and the best result is reported. The cross-validation error 
is calculated by using the leave-one-out method, i.e., S = N. The candidate 
number / of representative vectors is incremented in turn from 1 until the cross- 
validation error increases. The c4.5 program is used with the initial settings. 
Artificial data set. 

We consider an artificial law described by 

if <721 = 1 A ((731 = 1 V (733 = 1) then y = 2 + + 4:Xsx\^'^x^^^^ 

if <721 = 0 A ((732 = 1 V (734 = 1) then y = 2 + bx^^x^ + 2x3x\^‘^x^ (4) 



else 



then y = 2 + 4,Xi -I- ix^x'^^x^ 



1/2 - 1/3 



where we have three nominal and nine numeric explanatory variables, and the 
numbers of categories of q\, (72 and are set as Ti = 2, L 2 = 3 and L 3 = 4, 
respectively. Clearly, variables ( 71 , xe, • • • , xg are irrelevant to Eq. (4). Each value 
of nominal variables <71,(72, <73 is randomly generated so that only one dummy 
variable becomes 1 , each value of numeric variables xi , • • • , xg is randomly gener- 
ated in the range of ( 0 , 1 ), and we get the corresponding value of y by calculating 
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Eq. (4) and adding Gaussian noise with a mean of 0 and a standard deviation 
of 0.1. The number of examples is set to 400. 

In this experiment, a neural network was trained by setting the number of 
hidden units J to 2. We examined the performance of the experimental results 
obtained by applying the k-means algorithm with the different number of rep- 
resentative vectors, where the RMSE (root mean squared error) was used for 
the evaluation; the training error was evaluated as a rule set by using Eq. (3); 
the cross-validation error was calculated by using the function CV; and the gen- 
eralization error was also evaluated as a rule set and measured by using a set 
of noise-free 10, 000 test examples generated independently of the training ex- 
amples. The experimental results showed that the training error almost mono- 
tonically decreased (2.090, 0.828, 0.142, and 0.142 for / = 1, 2, 3, and 4, re- 
spectively); the cross-validation error was minimized when 1=3 (2.097, 0.841, 
0.156, and 0.160 for / = 1, 2, 3, and 4, respectively, i.e., indicating that an ad- 
equate number of representative vectors is 3); and the generalization error was 
also minimized when 1 = 3 (2.814, 1.437, 0.320, and 0.322 for / = 1, 2, 3, and 
4, respectively). Since the cross-validation and generalization errors were mini- 
mized with the same number of representative vectors, we can consequently see 
that the desirable model was selected by using the cross-validation. 

By applying the c4.5 program, we obtained the following decision tree whose 
leaf nodes correspond to the following. 



921 = 0: 934 = 


1: 2 (83.0) 


<tk> 


r2 = (-k5.04,-k2.13) 


934 = 


0: 932 = 0 : 3 (129.0) 


<tk> 


r3 = (-k3.96,-k2.97) 


1 1 


932 = 1: 2 (53.0) 


<tk> 


r2 = (-k5.04,-k2.13) 


921 = 1: 934 = 


1: 3 (36.0) 


<tk> 


r3 = (-k3.96,-k2.97) 


II 


0 : 932 = 0: 1 (73.0) 


<tk> 


= (-k3.10,-k4.07) 


1 1 


932 = 1: 3 (26.0) 


<tk> 


r3 = (-k3.96,-k2.97) 



where the coefficient values were rounded off to the second decimal place; each 
number of training examples arriving at the corresponding leaf node is shown in 
parenthesis. Then, the following rule set was straightforwardly obtained. 

if 921 = 1 A (931 = 1 V 933 = 1) 

then y = 2.01 -k 3.10xC^ °°x^^ °^ -k 

if 921 = 0 A (932 = 1 V 934 = 1) (5) 

then y = 2.01 -k 5.04xC^ °°xJ^ °^ -k 
else then y = 2.01 -k 3.96xC^ °°x^^ °^ -k 

Recall that each nominal variable matches only one category, e.g., (932 = IA934 = 
0) = (932 = 1). Therefore, although some of the weight values were slightly 
different, we can see that a law almost equivalent to the true one was found. 

Financial data set. 

We performed an experimental study to discover underlying laws of market capi- 
talization from six fundamental BS (Balance Sheet) items and the type of indus- 
try (the 33 classifications of the Tokyo Stock Exchange). Our experiments used 
data from 953 companies listed on the first section of the TSE, where banks. 
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and insurance, securities and recently listed companies were excluded. In order 
to understand the effect of the nominal variable intuitively, the number of hidden 
units was fixed at 1. The cross-validation error was minimized at / = 3. Then, 
the following rule set was obtained. 

if \J qi = lthen y = 12891.6 -P i = 1,2,3 (6) 

qi&Q' 

where = -Pl.907, = -Pl.122, = -PO.657 and each of the nominal con- 
ditions was as follow: = {“Pharmaceuticals”, “Rubber Products”, “Metal 

Products” , “Machinery” , “Electrical Machinery” , “Transport Equipment” , “Pre- 
cision Instruments”, “Other Products”, “Communications”, “Services”}; = 
{ “Foods” , “Textiles” , “Pulp & Paper” , “Chemicals” , “Glass & Ceramics” , “Non- 
ferrous Metals” , “Maritime Transport” , “Retail Trade”}; and = (“Fisheries”, 
“Mining”, “Construction”, “Oil & Coal Products”, “Iron & Steal”, “Electric- 
ity & Gas”, “Land Transport”, “Air Transport”, “Wearhousing”, “Wholesale”, 
“Other Financing Business”, “Real Estate”}. Since the second term on the right 
hand side of the polynomials appearing in Eq. (6) is always positive, each of 
the coefficient values r* can indicate the stock price setting tendency of industry 
groups in similar BS situations, i.e., the discovered law tells us that industries 
appearing in are likely to have a high setting, while those in are likely to 
have a low setting. 

Automobile data set. 

The Automobile data set contained data on the car and truck specifications in 
1985, and was used to predict prices based on these specifications. The data set 
had 159 examples with no missing values, and consisted of 10 nominal and 14 
numeric explanatory variables and one target variable (price). In this experiment, 
since the number of examples was small, the number of hidden units was also set 
to 1. The cross-validation error was minimized at / = 3. The polynomial part of 
the discovered law was as follows: 

y = 1163.16 + r*X+'®384°046^-1.436^+0.997^-0.245^-a (7) 

where = -1-1.453, = -1-1.038, = -1-0.763 and the relatively simple nominal 

conditions were obtained. Similarly as described for the experiments using the 
financial data set, since the second term on the right hand side of Eq. (7) is always 
positive, the coefficient value r* can indicate the car price setting tendency for 
similar specifications. Actually, the discovered law verbally told us that cars of 
a high price setting are: “5-cylinder ones”, “BMW’s”, “convertibles”, “VOLVO 
turbos” , “SAAB turbos” , and “6-cylinder turbos” ; cars of a middle price setting 
are: “PEUGOT’s”, “VOLVO non-turbos”, “SAAB non-turbos”, “HONDA Ibbl- 
fuel-system ones”, “MAZDA fair-risk-level ones”, “non-BMW non-turbos & 6- 
cylinder ones” , “non-5-cylinder turbos & fair-risk-level ones” ; and other cars are 
of a low price setting. 
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